Submission fails even though smoke test passes

Hey everyone.

I am having trouble getting my submission through. Smoke tests get’s evaluated just fine and the local testing also works fine, but the submission says there is an error generating the file which is a pretty generic response.

How can I debug this to figure out what is wrong? I am submitting the same submission.zip that I am submitting in the smoke test. What can possibly go wrong that isn’t captured in the smoke test?

Submission id: id-271072

Thanks for the help in advance.

probably due to memory usage if smokes fine

Which memory though ? Ram, gpu, disk ?

@aiva00 I looked into your submission, and I believe you are running out of GPU memory. The relevant snippet of the error message is below:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.72 GiB. GPU 0 has a total capacty of 15.56 GiB of which 5.59 GiB is free. Process 238127 has 9.96 GiB memory in use. Of the allocated memory 9.83 GiB is allocated by PyTorch, and 13.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thanks for your reply. It helped me debug this issue. Can I just ask if there is any limitation on the size of the zip we are submitting?

Hi @aiva00 , the disk storage on our node is about ~300GB, but we don’t recommend uploading extremely large files, because the upload might take a while (the max time allowed for upload is 20 hours) and, depending on what you’re doing with your uploaded files in your submission, you might end up running into memory constraints (16GB VRAM and 56GB CPU RAM). Good luck!

1 Like