PyFR 1.13.0: CUDAOSError

Hi,

I have a error message as I run my case:

Traceback (most recent call last):
  File "/home/x_zhyua/.conda/envs/pyfr13_env/bin/pyfr", line 33, in <module>
    sys.exit(load_entry_point('pyfr==1.13.0', 'console_scripts', 'pyfr')())
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 117, in main
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 250, in process_run
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 229, in _process_common
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/__init__.py", line 12, in get_backend
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/cuda/base.py", line 20, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/cuda/driver.py", line 209, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/ctypesutil.py", line 33, in _errcheck
pyfr.backends.cuda.driver.CUDAOSError

This case is what I am running for last two months and everything works very fine. It should be something related to the cluster setting since I have very same problem even running example cases. I cannot get immediately help from cluster support and maybe you have idea what this error is. Thank you in ahead.

Best wishes,
Zhenyang

It is hard to say; CUDA OS errors are extremely generic. I would suggest seeing if you can successfully run any other CUDA applications on the cluster; ideally those which make use of the driver API.

There error is being raised here:

which really should not fail.

Regards, Freddie.

Hi Freddie,

I have searched on the internet and tried a simple code to test cuda driver:

#include <iostream>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>

int main(void)
{
    CUresult a;
    CUcontext pctx;
    CUdevice device;
    cuInit(0);
    cuDeviceGet(&device, 0);
    std::cout << "DeviceGet : " << a << std::endl;
    cuCtxCreate(&pctx, CU_CTX_SCHED_AUTO, device ); // explicit context here
    std::cout << "CtxCreate : " << a << std::endl;
    assert(a == CUDA_SUCCESS);
    a = cuCtxPopCurrent(&pctx);
    std::cout << "cuCtxPopCurrent : " << a << std::endl;
    assert(a == CUDA_SUCCESS);
    std::cout << "Initialized CUDA" << std::endl;

    return 0;
}

which works:
Screenshot 2022-02-16 at 10.28.58

Is that adequate to test the function?

Best wishes,
Zhenyang

Given the error that you are getting I think PyFR is able to find the CUDA library, I think you would get OSError: Unable to load cuda if it couldn’t.

Given that it fails on the init I might be that when you ran PyFR no GPUs where allocated to the job? Do you see this problem if you try running one of the examples on an instance that your above test program works on?

I tried it, same problem still occurs.

I also find a python script for cuda to check device information which uses init function: Simple python script to obtain CUDA device information · GitHub

the result is I can init cuda devices and check everything:

[x_zhyua@node036 CUDA]$ python cuda_check.py 
Found 8 device(s).
Device: 0
  Name: NVIDIA A100-SXM4-40GB
  Compute Capability: 8.0
  Multiprocessors: 108
  CUDA Cores: 6912
  Concurrent threads: 221184
  GPU clock: 1410 MHz
  Memory clock: 1215 MHz
  Total Memory: 40536 MiB
  Free Memory: 40122 MiB
......

It seems init function works very well. I tried this function in the same environment and on the same device but pyfr code is not working.

There will be something different about the environment. Can you confirm if your MPI library is CUDA aware or not?

Regards, Freddie.

I ran this command to check:

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value

and have the feedback:

mca:mpi:base:param:mpi_built_with_cuda_support:value:true

This is most likely your issue. If OpenMPI has been built (or tries to load) a different version of CUDA to what PyFR loads, then you will likely experience an error.

Regards, Freddie.

I am a bit confused. When I built PyFR at the beginning, I only used two modules:

module load Anaconda/2021.05-nsc1
module load buildenv-gcccuda/11.2-8.3.1-bare

and that worked some days ago. Although cuda is not required when building, it is the only version of cuda can be used by PyFR. I noticed this version of cuda built time by checking nvcc --version, it was almost a year ago.

In order to try to figure out, I rebuilt PyFR and run it under different versions of cuda which all have the very same error.

Best wishes,
Zhenyang

PyFR is not ‘built’ and it does not make any use of nvcc. Rather, we load the active CUDA shared library at runtime. As I said before, the issue is likely to do with your MPI library. Please recompile this either without support for CUDA or against your desired version of CUDA.

Regards, Freddie.

Ah I see, thanks for explanation. I will try to do this.

Best wishes,
Zhenyang

I recompiled MPI library without support of CUDA this time.

[x_zhyua@berzelius001 inc_cylinder_2d]$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:false

But still the same problem occurs.

Can you confirm that mpi4py has also been recompiled against this new MPI library? (And similarly, HDF5 if you are compiling it with parallel I/O support.)

But I should stress that if a CUDA API is returning an error that this is a CUDA problem, and should probably be directed towards the NVIDIA support forums (who can likely better explain the circumstances which would lead to cuInit returning the value it does.)

Regards, Freddie.

Ah I have recompiled mpi4py under new MPI library and now everything is working!

Thanks a lot for patient explanation and guidance.

Best wishes,'Zhenyang

1 Like