Submission runtime usage and timeout status

VictorCallejas · May 25, 2022, 6:27pm

Hi,

I am working on the Whale-do challenge and my solution uses about the 3 hours available for the submission.

I would want to maximize resources usage, for that is there any way to know the GPU RAM/GPU/RAM//CPU usage?

Also when I get a timeout I do not know at which point it was, I mean in the first scenarios or in the last ones?

With the information provided in the problem statement, number of scenarios and number of image queries and databases, I can make an idea if my model will infer in time, but this is not very precise.

Also could we know the model of the GPU? To see if it would benefit from fp16 during inference.

Any way to obtain this info? Thanks!

mike-dd · May 26, 2022, 6:32pm

Hi @VictorCallejas,

On the Code Execution Status page, you will see a link to the submission logs. Perhaps you could find a way to output the information you need to the logs? In addition to the standard library logging module, the runtime also has the loguru library available. When logging, please remember that you are allowed to log progress, but you are not allowed to print out the test data contents.

The GPU model is NVIDIA Tesla K80. You can find more details here for the "Standard_NC6” size.

I hope this helps!

VictorCallejas · May 26, 2022, 6:37pm

Sure it helps! Thanks!

VictorCallejas · May 27, 2022, 9:24am

As conda is being runned with the flag --no-capture-output (stdout/stderr), I do not think any logging option is viable.

Anyway as we can see errors when something fails, I found a way, basically raising a ValueError with whatever data, example:

import time

START= time.time()
LOG = ''
...
if STEP % 500:
LOG+=f'{EPOCH} {STEP} {START - time.time()} / {len(dataloader)}\n'  
raise ValueError(LOG)

I am using this to see if my model will infer in time or no.

Also to optimize the batch_size to the max cuda memory, but for that I will make a pull request to update pytorch to the latest version and use this:
torch.cuda.mem_get_info(device=0)

Use it well

For me an aproximation of the scenarios took 2:20h in Colab but here is 4h aprox. I will try to convert the trained model from PyTorch to TensorRT and maximize batch_size

jayqi · May 27, 2022, 3:00pm

Hi @VictorCallejas,

--no-capture-output controls whether conda buffers stdout, and should not prevent stdout or stderr from being printed to the logs.

Our benchmark example uses loguru and tqdm and we can see the expected outputs in the logs.

VictorCallejas · May 27, 2022, 3:08pm

Thanks for the answer!

That’s a far better approach indeed

Topic		Replies	Views
Failed Submission On Cloud N	2	294	January 30, 2022
Request to print installed packages at the beginning of the logs STAC Overflow: Map Floodwater from Radar Imagery	1	363	August 9, 2021
IMPORTANT: How to speed up Inference & Queue time On Cloud N	6	662	January 5, 2022
Normal Submission Time Limit Water Supply Forecast Rodeo	1	129	December 13, 2023
Error in submission Hakuna Ma-data	3	690	January 16, 2020

Submission runtime usage and timeout status

Related topics