Submission Jobs failing because of CUDA out of memory

@kica22 , yes, I had the same issue. Here is what I understand from your logs, your local machine has far less GPU memory than what is being requested. 4GB vs 16GB. I don’t know precisely what can be run on 4GB GPU machine , but can suggest try few things

  • Try reducing the batch size to low - maybe between 5-8. On your machine, and possibly Google colab with resnet34 backbone you might be able to work it out.
  • Not use GPU, but that will slow things down, but you can try it out for limited epochs. I think @Max_Schaefer has a post regarding that.

Also, @Max_Schaefer has proposed some pointers on OOM which can help, see below