Submission Jobs failing because of CUDA out of memory

ajijohn · January 6, 2022, 7:46pm

@kica22 , yes, I had the same issue. Here is what I understand from your logs, your local machine has far less GPU memory than what is being requested. 4GB vs 16GB. I don’t know precisely what can be run on 4GB GPU machine , but can suggest try few things

Try reducing the batch size to low - maybe between 5-8. On your machine, and possibly Google colab with resnet34 backbone you might be able to work it out.
Not use GPU, but that will slow things down, but you can try it out for limited epochs. I think @Max_Schaefer has a post regarding that.

Also, @Max_Schaefer has proposed some pointers on OOM which can help, see below

Topic		Replies	Views
Submission runtime usage and timeout status Where's Whale-do?	5	398	May 27, 2022
Cuda Issue Main.py Submissions On Cloud N	2	382	December 17, 2021
Submission fails even though smoke test passes Youth Mental Health: Automated Abstraction	5	91	November 7, 2024
Final Submission - Pandemic Centralized PETs Prize Challenge	1	206	January 20, 2023
IMPORTANT: How to speed up Inference & Queue time On Cloud N	6	646	January 5, 2022

Submission Jobs failing because of CUDA out of memory

Related topics