Submission Jobs failing because of CUDA out of memory

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 11.17 GiB total capacity; 10.17 GiB already allocated; 235.88 MiB free; 10.53 GiB reserved in total by PyTorch)
ERROR conda.cli.main_run:execute(33): Subprocess for ‘conda run [‘python’, ‘main.py’]’ command failed. (See above for error)

How do I go about it ? My last two submissions failed because of this.

Try to reduce model inference batch size to avoid GPU OOM. You can estimate maximum batch size if you run cloud-cover-runtime docker locally and monitor GPU memory. K80 GPU is 11GB max.

Thx @MPWARE for your advise, I was training on a larger GPU and wasn’t paying attention. By reducing the batch size, it seems to be running much faster, and is using significantly less GPU memory.

Planning to resubmit, will update.

Hello I am also facing a similar error when trying to run the code on GPU on my local machine. Any help on how I can troubleshoot? @ajijohn @MPWARE

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 4.00 GiB total capacity; 2.17 GiB already allocated; 0 bytes free; 2.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@kica22 , yes, I had the same issue. Here is what I understand from your logs, your local machine has far less GPU memory than what is being requested. 4GB vs 16GB. I don’t know precisely what can be run on 4GB GPU machine , but can suggest try few things

  • Try reducing the batch size to low - maybe between 5-8. On your machine, and possibly Google colab with resnet34 backbone you might be able to work it out.
  • Not use GPU, but that will slow things down, but you can try it out for limited epochs. I think @Max_Schaefer has a post regarding that.

Also, @Max_Schaefer has proposed some pointers on OOM which can help, see below