Submission runtime usage and timeout status

Hi,

I am working on the Whale-do challenge and my solution uses about the 3 hours available for the submission.

I would want to maximize resources usage, for that is there any way to know the GPU RAM/GPU/RAM//CPU usage?

Also when I get a timeout I do not know at which point it was, I mean in the first scenarios or in the last ones?

With the information provided in the problem statement, number of scenarios and number of image queries and databases, I can make an idea if my model will infer in time, but this is not very precise.

Also could we know the model of the GPU? To see if it would benefit from fp16 during inference.

Any way to obtain this info? Thanks! :smile:

Hi @VictorCallejas,

On the Code Execution Status page, you will see a link to the submission logs. Perhaps you could find a way to output the information you need to the logs? In addition to the standard library logging module, the runtime also has the loguru library available. When logging, please remember that you are allowed to log progress, but you are not allowed to print out the test data contents.

The GPU model is NVIDIA Tesla K80. You can find more details here for the "Standard_NC6” size.

I hope this helps!

2 Likes

Sure it helps! Thanks!

As conda is being runned with the flag --no-capture-output (stdout/stderr), I do not think any logging option is viable.

Anyway as we can see errors when something fails, I found a way, basically raising a ValueError with whatever data, example:

import time

START= time.time()
LOG = ''
...
if STEP % 500:
LOG+=f'{EPOCH} {STEP} {START - time.time()} / {len(dataloader)}\n'  
raise ValueError(LOG)

I am using this to see if my model will infer in time or no.

Also to optimize the batch_size to the max cuda memory, but for that I will make a pull request to update pytorch to the latest version and use this:
torch.cuda.mem_get_info(device=0)

Use it well :slightly_smiling_face:

For me an aproximation of the scenarios took 2:20h in Colab but here is 4h aprox. I will try to convert the trained model from PyTorch to TensorRT and maximize batch_size

Hi @VictorCallejas,

--no-capture-output controls whether conda buffers stdout, and should not prevent stdout or stderr from being printed to the logs.

Our benchmark example uses loguru and tqdm and we can see the expected outputs in the logs.

1 Like

Thanks for the answer!

That’s a far better approach indeed :sweat_smile: