PyFR 1.13.0: CUDAOSError

Zhenyang · 15 February 2022 12:25

Hi,

I have a error message as I run my case:

Traceback (most recent call last):
  File "/home/x_zhyua/.conda/envs/pyfr13_env/bin/pyfr", line 33, in <module>
    sys.exit(load_entry_point('pyfr==1.13.0', 'console_scripts', 'pyfr')())
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 117, in main
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 250, in process_run
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 229, in _process_common
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/__init__.py", line 12, in get_backend
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/cuda/base.py", line 20, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/cuda/driver.py", line 209, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/ctypesutil.py", line 33, in _errcheck
pyfr.backends.cuda.driver.CUDAOSError

This case is what I am running for last two months and everything works very fine. It should be something related to the cluster setting since I have very same problem even running example cases. I cannot get immediately help from cluster support and maybe you have idea what this error is. Thank you in ahead.

Best wishes,
Zhenyang

fdw · 15 February 2022 12:50

It is hard to say; CUDA OS errors are extremely generic. I would suggest seeing if you can successfully run any other CUDA applications on the cluster; ideally those which make use of the driver API.

There error is being raised here:

github.com

PyFR/PyFR/blob/c8c053d5c0e34ac7a5639c4d328ebacde10fe689/pyfr/backends/cuda/driver.py#L209

      
        
            
            
        self.cuda.lib.cuLaunchKernel(self, *grid, *block, sharedb, stream,
                                                 self._karg_ptrs, None)
            
            

            
class CUDA(object):
                def __init__(self):
                    self.lib = CUDAWrappers()
                    self.ctx = c_void_p()
            
            
        self.lib.cuInit(0)
            
            
    def __del__(self):
                    if getattr(self, 'ctx', None):
                        self.lib.cuDevicePrimaryCtxRelease(self.dev)
            
            
    def _get_cache_pref(self, prefer_l1, prefer_shared):
                    if prefer_l1 is None and prefer_shared is None:
                        return self.lib.FUNC_CACHE_PREFER_NONE
                    elif prefer_l1 and not prefer_shared:
                        return self.lib.FUNC_CACHE_PREFER_L1

which really should not fail.

Regards, Freddie.

Zhenyang · 16 February 2022 09:32

Hi Freddie,

I have searched on the internet and tried a simple code to test cuda driver:

#include <iostream>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>

int main(void)
{
    CUresult a;
    CUcontext pctx;
    CUdevice device;
    cuInit(0);
    cuDeviceGet(&device, 0);
    std::cout << "DeviceGet : " << a << std::endl;
    cuCtxCreate(&pctx, CU_CTX_SCHED_AUTO, device ); // explicit context here
    std::cout << "CtxCreate : " << a << std::endl;
    assert(a == CUDA_SUCCESS);
    a = cuCtxPopCurrent(&pctx);
    std::cout << "cuCtxPopCurrent : " << a << std::endl;
    assert(a == CUDA_SUCCESS);
    std::cout << "Initialized CUDA" << std::endl;

    return 0;
}

which works:
Screenshot 2022-02-16 at 10.28.58

Is that adequate to test the function?

Best wishes,
Zhenyang

WillT · 16 February 2022 09:47

Given the error that you are getting I think PyFR is able to find the CUDA library, I think you would get OSError: Unable to load cuda if it couldn’t.

Given that it fails on the init I might be that when you ran PyFR no GPUs where allocated to the job? Do you see this problem if you try running one of the examples on an instance that your above test program works on?

Zhenyang · 16 February 2022 10:49

I tried it, same problem still occurs.

I also find a python script for cuda to check device information which uses init function: Simple python script to obtain CUDA device information · GitHub

the result is I can init cuda devices and check everything:

[x_zhyua@node036 CUDA]$ python cuda_check.py 
Found 8 device(s).
Device: 0
  Name: NVIDIA A100-SXM4-40GB
  Compute Capability: 8.0
  Multiprocessors: 108
  CUDA Cores: 6912
  Concurrent threads: 221184
  GPU clock: 1410 MHz
  Memory clock: 1215 MHz
  Total Memory: 40536 MiB
  Free Memory: 40122 MiB
......

It seems init function works very well. I tried this function in the same environment and on the same device but pyfr code is not working.

fdw · 16 February 2022 12:41

There will be something different about the environment. Can you confirm if your MPI library is CUDA aware or not?

Regards, Freddie.

Zhenyang · 16 February 2022 13:26

I ran this command to check:

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value

and have the feedback:

mca:mpi:base:param:mpi_built_with_cuda_support:value:true

fdw · 16 February 2022 14:02

This is most likely your issue. If OpenMPI has been built (or tries to load) a different version of CUDA to what PyFR loads, then you will likely experience an error.

Regards, Freddie.

Zhenyang · 19 February 2022 15:49

I am a bit confused. When I built PyFR at the beginning, I only used two modules:

module load Anaconda/2021.05-nsc1
module load buildenv-gcccuda/11.2-8.3.1-bare

and that worked some days ago. Although cuda is not required when building, it is the only version of cuda can be used by PyFR. I noticed this version of cuda built time by checking nvcc --version, it was almost a year ago.

In order to try to figure out, I rebuilt PyFR and run it under different versions of cuda which all have the very same error.

Best wishes,
Zhenyang

fdw · 19 February 2022 16:24

PyFR is not ‘built’ and it does not make any use of nvcc. Rather, we load the active CUDA shared library at runtime. As I said before, the issue is likely to do with your MPI library. Please recompile this either without support for CUDA or against your desired version of CUDA.

Regards, Freddie.

Zhenyang · 19 February 2022 16:26

Ah I see, thanks for explanation. I will try to do this.

Best wishes,
Zhenyang

Zhenyang · 19 February 2022 17:40

I recompiled MPI library without support of CUDA this time.

[x_zhyua@berzelius001 inc_cylinder_2d]$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:false

But still the same problem occurs.

fdw · 20 February 2022 00:07

Can you confirm that mpi4py has also been recompiled against this new MPI library? (And similarly, HDF5 if you are compiling it with parallel I/O support.)

But I should stress that if a CUDA API is returning an error that this is a CUDA problem, and should probably be directed towards the NVIDIA support forums (who can likely better explain the circumstances which would lead to cuInit returning the value it does.)

Regards, Freddie.

Zhenyang · 20 February 2022 11:07

Ah I have recompiled mpi4py under new MPI library and now everything is working!

Thanks a lot for patient explanation and guidance.

Best wishes,'Zhenyang

Topic		Replies	Views
Another problem with cuda backend: cuda driver? General	6	1662	15 June 2021
Problem with CUDA when installing PyFR on Windows10 General	19	1913	28 August 2021
Error: undefined symbol: cuDeviceGetUuid_v2 Errors cuda	5	267	18 October 2023
Problem with CUDA backend General	2	457	26 October 2020
Cuda error running examples Errors cuda	3	187	20 September 2023

PyFR 1.13.0: CUDAOSError

Related topics