Submission Jobs failing because of CUDA out of memory

ajijohn · January 2, 2022, 8:59pm

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 11.17 GiB total capacity; 10.17 GiB already allocated; 235.88 MiB free; 10.53 GiB reserved in total by PyTorch)
ERROR conda.cli.main_run:execute(33): Subprocess for ‘conda run [‘python’, ‘main.py’]’ command failed. (See above for error)

How do I go about it ? My last two submissions failed because of this.

MPWARE · January 2, 2022, 10:11pm

Try to reduce model inference batch size to avoid GPU OOM. You can estimate maximum batch size if you run cloud-cover-runtime docker locally and monitor GPU memory. K80 GPU is 11GB max.

ajijohn · January 4, 2022, 10:51pm

Thx @MPWARE for your advise, I was training on a larger GPU and wasn’t paying attention. By reducing the batch size, it seems to be running much faster, and is using significantly less GPU memory.

Planning to resubmit, will update.

kica22 · January 6, 2022, 7:23pm

Hello I am also facing a similar error when trying to run the code on GPU on my local machine. Any help on how I can troubleshoot? @ajijohn @MPWARE

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 4.00 GiB total capacity; 2.17 GiB already allocated; 0 bytes free; 2.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ajijohn · January 6, 2022, 7:46pm

@kica22 , yes, I had the same issue. Here is what I understand from your logs, your local machine has far less GPU memory than what is being requested. 4GB vs 16GB. I don’t know precisely what can be run on 4GB GPU machine , but can suggest try few things

Try reducing the batch size to low - maybe between 5-8. On your machine, and possibly Google colab with resnet34 backbone you might be able to work it out.
Not use GPU, but that will slow things down, but you can try it out for limited epochs. I think @Max_Schaefer has a post regarding that.

Also, @Max_Schaefer has proposed some pointers on OOM which can help, see below

Topic		Replies	Views
Submission runtime usage and timeout status Where's Whale-do?	5	398	May 27, 2022
Cuda Issue Main.py Submissions On Cloud N	2	382	December 17, 2021
Submission fails even though smoke test passes Youth Mental Health: Automated Abstraction	5	91	November 7, 2024
Final Submission - Pandemic Centralized PETs Prize Challenge	1	206	January 20, 2023
IMPORTANT: How to speed up Inference & Queue time On Cloud N	6	648	January 5, 2022

Submission Jobs failing because of CUDA out of memory

Related topics