I am having trouble getting my submission through. Smoke tests get’s evaluated just fine and the local testing also works fine, but the submission says there is an error generating the file which is a pretty generic response.
How can I debug this to figure out what is wrong? I am submitting the same submission.zip that I am submitting in the smoke test. What can possibly go wrong that isn’t captured in the smoke test?
@aiva00 I looked into your submission, and I believe you are running out of GPU memory. The relevant snippet of the error message is below:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.72 GiB. GPU 0 has a total capacty of 15.56 GiB of which 5.59 GiB is free. Process 238127 has 9.96 GiB memory in use. Of the allocated memory 9.83 GiB is allocated by PyTorch, and 13.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hi @aiva00 , the disk storage on our node is about ~300GB, but we don’t recommend uploading extremely large files, because the upload might take a while (the max time allowed for upload is 20 hours) and, depending on what you’re doing with your uploaded files in your submission, you might end up running into memory constraints (16GB VRAM and 56GB CPU RAM). Good luck!