Is CUDA 11 absolutely required?

Kaungkhantko · November 4, 2024, 10:24pm

Hello,

I’ve been trying to run “make test-submission” but it’s showing

could not select device driver “” with capabilities: [[gpu]].

I have reinstalled my CUDA driver (12.4.1), nvidia-utils (550.120-1), as well as nvidia-container-toolkit (1.16.1-3) but nothing so far seems to work.

So I’m wondering if the problem is my CUDA version, because the runtime github repo stated that we are required to have CUDA 11. Can someone confirm this for me, I really appreciate this.

System Information:
Operating System: Manjaro Linux
KDE Plasma Version: 6.1.5
KDE Frameworks Version: 6.6.0
Qt Version: 6.7.2
Kernel Version: 6.11.2-4-MANJARO (64-bit)
Graphics Platform: X11
GPU: Nvidia RTX 2060 Super

jayqi · November 4, 2024, 10:55pm

Hi @Kaungkhantko,

Based on my understanding of CUDA drivers and the NVIDIA Container Toolkit, they should be backwards compatible, in the sense that the relatively recent versions you have installed on the host should support a CUDA runtime library in the container that is an older version like 11.8.

Can you provide more logs or more information about what is writing out the error message that you’ve shown?

Can you also confirm that the image you are using is the GPU version of the image? Is it a locally built image, or is from make pull? When you run make test-submission, I believe it should print out the name of the image before it starts the container.

Kaungkhantko · November 5, 2024, 12:05am

This is what “make test-submission” prints out:

Using image: cdc-narratives-competition:gpu-local (3d25cf77c6ba)
┏
┃ NAME(S)
┃ cdc-narratives-competition:gpu-local
└

Available official images:
┏
┃ REPOSITORY TAG IMAGE ID CREATED SIZE
┃ cdcnarratives.azurecr.io/cdc-narratives-competition gpu-latest 081ea9282b0f 5 days ago 22.5GB
└

Available local images:
┏
┃ REPOSITORY TAG IMAGE ID CREATED SIZE
┃ cdc-narratives-competition gpu-local 3d25cf77c6ba 3 days ago 18.9GB
└

mkdir -p submission/
chmod -R 0777 submission/
docker run
-it
–gpus all
–network none
-e LOGURU_LEVEL=INFO
-e IS_SMOKE_TEST=true
–mount type=bind,source=/home/kaung/youth-mental-health-runtime/data,target=/code_execution/data,readonly
–mount type=bind,source=“/home/kaung/youth-mental-health-runtime/submission”,target=/code_execution/submission
–shm-size 8g
–pid host
–name cdc-narratives-competition
–rm
3d25cf77c6ba
docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]].
make: *** [Makefile:198: test-submission] Error 125

I’ve also recently tried running this "sudo docker run --rm --gpus all cdcnarratives.azurecr.io/cdc-narratives-competition:gpu-latest " and got a different message:

main

tee /code_execution/submission/log.txt
tee: /code_execution/submission/log.txt: No such file or directory

expected_filename=main.py

cd /code_execution
++ zip -sf ./submission/submission.zip

I have a feeling that using the online image helped me fix the issue of not detecting the gpu, and that’s why I’m seeing a different error here.

jayqi · November 5, 2024, 2:40am

Hi @Kaungkhantko,

The fact that you were able to not get a GPU error from using the pulled cdcnarratives.azurecr.io/cdc-narratives-competition:gpu-latest image is good—it means that there’s something specific about the first case that is not working, but that your overall setup should be fine.

Since you have a local image built, it’ll default to using that when you use the Makefile commands. You’ll need to use the SUBMISSION_IMAGE environment variable to specify a different image, like:

SUBMISSION_IMAGE=cdcnarratives.azurecr.io/cdc-narratives-competition:gpu-latest make test-submission

(this is a long command, make sure you grab the whole line)

Alternatively, you can delete your local image.

The reason you’re getting a different error is because the image expects several mounted directories, which your sudo docker run command does not have. It’s these lines you see in the make test-submission printout:

–mount type=bind,source=/home/kaung/youth-mental-health-runtime/data,target=/code_execution/data,readonly
–mount type=bind,source=“/home/kaung/youth-mental-health-runtime/submission”,target=/code_execution/submission

wdong · November 11, 2024, 3:09pm

I hoped to get vLLM and maybe a couple of other useful packages into the docker. But after some investigation of the competition platform by smoke test submissions, I came to the conclusion that it will be fruitless to try to mess with the official docker configuration.

The hard constraint is that the competition server runs the CUDA 11 driver. This limits whatever software running above in docker to CUDA 11. This means that only a very early version of vLLM will be compatible. It would be easy for you to replace CUDA in your own machine, but they are very unlikely to be able to transfer that to the competition servers.

I ended up working happily with the transformers library. I’m grateful that the server is compatible with my local models trained with transformers 4.44.1 and recent versions of CUDA and torch.

Topic		Replies	Views
CUDA 11.0 and cuDNN / NVIDIA driver versions PETs Prize Challenge	0	554	November 24, 2022
Could not select device driver "" with capabilities: [[gpu]] Where's Whale-do?	1	5392	June 8, 2022
Version consistency between CUDA, PyTorch and other related modules PETs Prize Challenge	9	1501	January 24, 2023
RuntimeError: cuDNN error: CUDNN_STATUS_ARCH_MISMATCH VisioMel Challenge	8	1770	May 7, 2023
Runtime Environment Torch not compiled with CUDA enabled PETs Prize Challenge	2	212	January 4, 2023

Is CUDA 11 absolutely required?

Related topics