Resource issue when submitting for Track B federated

Hi, I’m running into the following errors with my federated submission for Track B Pandemic:

...
Error: No available node types can fulfill resource request {'CPU': 5.0, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

The submission is still ongoing for several hours now. Is the issue expected? In case this is due to a submission queue – if our solution have an option to not use an GPU, would it help speed up the evaluation?

Thanks!

Hi @kzliu,

This is not related to the submission queue, and probably not related to whether your solution involves a GPU.

The entire job runs inside one container. When it says “cluster” in the log, there is not actually any cluster; Ray is just doing multiprocessing on one node. This error log says that inside this container—which is supposed to have 6 CPUs and 1 GPU—it’s not able to spawn a new process that has access to 5 CPUs and 1 GPU.

I’m not seeing this error for other running jobs, so there is something specific about your solution that makes the system think that there are not enough CPUs or GPUs available. Are there other process you’ve run from the server/strategy that are using significant resources?

1 Like

I see — I was hoping to try CPU only execution by putting os.environ['CUDA_VISIBLE_DEVICES'] = '' at the top of my solution files. Would this be relevant? If so would there be any other recommended way to disable GPUs? Thanks!

If you set the environment variable CPU_OR_GPU to "cpu" (or in reality, any value that is not "gpu"), it should not request a GPU when spawning new processes for running Client methods. This is the code where that configuration happens.

1 Like

Please also note that if you’re seeing that message at all, then that’s an error that likely won’t recover over time. You should just cancel that job in order to free up the shared resources for other participants.

It may be the case that Ray is unable to spawn a new process because the number of CPUs available is not enough. I’m taking a look at this to see if there is a fix I can make for this.

1 Like

I have cancelled the job and resubmitted another one with os.environ['CPU_OR_GPU'] = 'cpu' instead for now. Thanks for looking into this!