Version consistency between CUDA, PyTorch and other related modules

ryokawa · January 19, 2023, 2:18am

Hi,
When I was checking the versions of CUDA and PyTorch in the docker container that is used in pets-prize-challenge-runtime, I noticed that the version numbers returned through various interfaces seem to be inconsistent. This topic is somewhat similar to a past question here.
CUDA 11.0 and cuDNN / NVIDIA driver versions

PyTorch: 1.12.1.post201
CUDA version by PyTorch: 11.2
CUDA version by nvcc: V12.0.76
CUDA runtime version: 11.0.x

I used the commands shown at the bottom of this message to collect the version information.
These version numbers should be consistent.
The last part of the PyTorch version number should correspond to a version number of CUDA, such as 1.12.1+cu116. However, post201 is an unknown number.

The reason of this inconsistency seems to be in the choice of the base image and the python modules.
In Dockerfile

FROM nvidia/cuda:11.0.3-base-ubuntu20.04

In environment-gpu.yml

  - nvidia::cuda-nvcc=12.0.76
  - pytorch-gpu=1.12.1

The exact binary package of PyTorch is determined by conda automatically from conda-forge channel, which resulted to choose 1.12.1.post201, which seems to support CUDA 11.2. However, that is different from the CUDA bundled in the base image (CUDA 11.0.x).
Separately, nvcc of a different version is installed.

I am not sure this inconsistency is still compatible at this moment, but at least I can say that this is not a popular or “standard” configuration. In the official website of PyTorch, PyTorch 1.12.1 binary package is assumed to be used with CUDA 10.2, 11.3, or 11.6. These are distributed in pytorch channel.

As a result, other PyTorch-related modules distributed by the ecosystem assume the combination of the versions of CUDA and PyTorch shown above.
I was trying to use PyG (torch-geometric, a graph neural network library), but I could not make PyG to work in the current runtime environment because it does not support the combination of CUDA 11.0 and PyTorch 1.12.1. The installation of PyG is successful but it causes a runtime error.

Is there any way to upgrade the CUDA version to 10.2, 11.3 or 11.6? I think I can submit a pull request to change the base image and change the versions of the related packages correspondingly.
(I know it is almost too late to propose this …)

I used the following commands to collect the version numbers within the container.

$ conda run -n condaenv python -c "import torch; print(torch._version_)"
$ conda run -n condaenv python -c "import torch; print(torch.version.cuda)"
$ conda run -n condaenv nvcc -V
$ ls -la /usr/local/cuda-11.0/targets/x86_64-linux/lib/

jayqi · January 20, 2023, 8:44pm

Hi @ryokawa (and also @kzliu given the other thread),

We’re doing some further investigation to make sure things currently installed in runtime image work correctly.

However, to give an initial response regarding the base image version (CUDA version 11.0.3): this version shouldn’t have any impact on the runtime because a different version of the CUDA runtime will be installed in the conda environment via cudatoolkit. We get drivers and other things from this base image version, but the CUDA runtime that comes with the base image is not used during evaluation. Accordingly, the base image version should not need to be changed, and the managing of CUDA runtime version compatibility happens entirely within what is installed in the conda environment.

We’re taking a look at the packages in the environment and will follow up.

jayqi · January 21, 2023, 6:18am

After testing, we are making the following changes to the runtime environment in this pull request to standardize on CUDA runtime 11.2, which we believe should still work for folks depending on the current versions of packages but also resolve errors that were seen.

All of the GPU-utilizing libraries like PyTorch, Tensorflow, JAX, or XGBoost in the environment are being pinned to CUDA 11.2 builds. This does not actually change any of these packages’ versions—they were previously resolving to cuda112 builds already and the new PR is just making this explicit and transparent.

We are pinning cudatoolkit to 11.2. Previously, cudatoolkit was not pinned and resolving to version 11.8. This turned out to be mostly compatible with the aforementioned cuda112 builds of other packages, but appeared to sometimes break (torch-geometric was reported). Downgrading to 11.2 should ensure the best and most consistent compatibility across packages.

Finally, we are installing NVCC via cudatoolkit-dev rather than the cuda-nvcc package from the nvidia channel. We were previously installing nvidia::cuda-nvcc to meet the requirements of JAX, per the JAX documentation. However, the important part is that JAX requires ptxas, and this is also available from conda-forge::cudatoolkit-dev. By installing conda-forge::cudatoolkit-dev, we can get versions that are compatible with CUDA runtime 11.2. We have tested the jit example from the JAX documentation in this setup and find that it seems to work.

@ryokawa @kzliu Please test out the new image and let us know if you have any issues.

@ryokawa : We did some testing with pytorch_geometric=2.2.0 and found that this environment seemed to work with it. If you would like to add pytorch_geometric to the runtime environment, please let us know or open a PR.

kzliu · January 21, 2023, 7:22am

Hi @jayqi, thanks a lot for looking into this!

I was trying a centralized smoke test around 20mins ago, and am observing an unexpected error:

...
Traceback (most recent call last):
File "/opt/conda/envs/condaenv/lib/python3.9/site-packages/graph_tool/__init__.py", line 373, in __del__
def __del__(self):
File "/opt/conda/envs/condaenv/lib/python3.9/site-packages/ray/_private/worker.py", line 1618, in sigterm_handler
sys.exit(signum)
SystemExit: 15

This suggests that the error happens when importing graph_tool which was fine to import in my previous submissions. Could this be related to the recent changes?

Thanks!

UPDATE: I believe it has something to do with passing a graph_tool.Graph object for multiprocessing with the multiprocess package. The same multiprocessing code runs fine in my local environment and I’m curious if multiprocess (or python native multiprocessing) is allowed for the official runtime? Thanks!

jayqi · January 21, 2023, 4:07pm

Hi @kzliu,

I don’t think there should be any limitations regarding the use of multiprocessing specific to our runtime.

Based on the traceback message and the error, it appears that the system sent your process a SIGTERM to kill it.

It’s possible that this is an out of memory error. The pandemic datasets are relatively large, and if you spawned 5 multiprocessing workers and data gets copied, you could easily run out of memory on the system.

ryokawa · January 22, 2023, 1:51pm

@jayqi Thank you for the updates!
I will test the new environment and check whether my torch-geometric program can be executed or not.

ryokawa · January 23, 2023, 8:47am

@jayqi

I submitted a pull request to add pytorch_geometric=2.2.0.
Would you merge this PR if it is not too late.?
Sorry about this last-minute change.

I have confirmed that a PyG program can be executed with the latest environment + this change. Thank you for the update.

(FYI) However, I could not execute it if I manually install pytorch_geometric using conda install -n condaenv -c conda-forge from the container after I built the official runtime environment. (The installation was successful, though. The following error occurred when importing torch_geometric.)

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

ryokawa · January 23, 2023, 2:51pm

@jayqi
To prepare for the case that the PR is not merged, I think I can manage to install pytorch_geometric locally and offline by pre-downloading the required binary packages as explained in the following URL.
https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/installing-with-conda.html

FYI: The above error from std::bad_alloc has been fixed. Now, I can install pytorch_geometric from the inside of the container. The root cause of the error seems to be that slightly different versions of pytorch_scatter, pytorch_sparse, and pyg-lib are installed in the case of the separate install. I needed to fix the versions precisely.

conda install -c conda-forge --freeze-installed  pytorch_geometric=2.2.0  pyg-lib=0.1.0=cuda112py39h83a068c_1 pytorch_scatter=2.1.0=cuda112py39h83a068c_0 pytorch_sparse=0.6.15=py39h83a068c_0

Then, we can get a complete list of the dependent packages during the installation. We can find those .bz2 and .conda files in /opt/conda/pkgs or elsewhere.

jayqi · January 23, 2023, 3:40pm

Hi @ryokawa,

This is merged into main and the CI build is currently pushing the new image to the container registry. Once this build is successfully completed, the new image with pytorch_geometric should be used by new submissions.

ryokawa · January 24, 2023, 3:30am

Thank you, @jayqi . It looks good to me.

Topic		Replies	Views
CUDA 11.0 and cuDNN / NVIDIA driver versions PETs Prize Challenge	0	553	November 24, 2022
RuntimeError: cuDNN error: CUDNN_STATUS_ARCH_MISMATCH VisioMel Challenge	8	1746	May 7, 2023
Runtime Environment Torch not compiled with CUDA enabled PETs Prize Challenge	2	211	January 4, 2023
Code submission - is it realistic that everyone has the same dependencies? PETs Prize Challenge	2	368	October 17, 2022
Cuda Issue Main.py Submissions On Cloud N	2	382	December 17, 2021

Version consistency between CUDA, PyTorch and other related modules

Related topics