I’ve been trying to run “make test-submission” but it’s showing
could not select device driver “” with capabilities: [[gpu]].
I have reinstalled my CUDA driver (12.4.1), nvidia-utils (550.120-1), as well as nvidia-container-toolkit (1.16.1-3) but nothing so far seems to work.
So I’m wondering if the problem is my CUDA version, because the runtime github repo stated that we are required to have CUDA 11. Can someone confirm this for me, I really appreciate this.
System Information:
Operating System: Manjaro Linux
KDE Plasma Version: 6.1.5
KDE Frameworks Version: 6.6.0
Qt Version: 6.7.2
Kernel Version: 6.11.2-4-MANJARO (64-bit)
Graphics Platform: X11
GPU: Nvidia RTX 2060 Super
Based on my understanding of CUDA drivers and the NVIDIA Container Toolkit, they should be backwards compatible, in the sense that the relatively recent versions you have installed on the host should support a CUDA runtime library in the container that is an older version like 11.8.
Can you provide more logs or more information about what is writing out the error message that you’ve shown?
Can you also confirm that the image you are using is the GPU version of the image? Is it a locally built image, or is from make pull? When you run make test-submission, I believe it should print out the name of the image before it starts the container.
The fact that you were able to not get a GPU error from using the pulled cdcnarratives.azurecr.io/cdc-narratives-competition:gpu-latest image is good—it means that there’s something specific about the first case that is not working, but that your overall setup should be fine.
Since you have a local image built, it’ll default to using that when you use the Makefile commands. You’ll need to use the SUBMISSION_IMAGE environment variable to specify a different image, like:
SUBMISSION_IMAGE=cdcnarratives.azurecr.io/cdc-narratives-competition:gpu-latest make test-submission
(this is a long command, make sure you grab the whole line)
Alternatively, you can delete your local image.
The reason you’re getting a different error is because the image expects several mounted directories, which your sudo docker run command does not have. It’s these lines you see in the make test-submission printout:
I hoped to get vLLM and maybe a couple of other useful packages into the docker. But after some investigation of the competition platform by smoke test submissions, I came to the conclusion that it will be fruitless to try to mess with the official docker configuration.
The hard constraint is that the competition server runs the CUDA 11 driver. This limits whatever software running above in docker to CUDA 11. This means that only a very early version of vLLM will be compatible. It would be easy for you to replace CUDA in your own machine, but they are very unlikely to be able to transfer that to the competition servers.
I ended up working happily with the transformers library. I’m grateful that the server is compatible with my local models trained with transformers 4.44.1 and recent versions of CUDA and torch.