Traceback (most recent call last):
File "/home/x_zhyua/.conda/envs/pyfr13_env/bin/pyfr", line 33, in <module>
sys.exit(load_entry_point('pyfr==1.13.0', 'console_scripts', 'pyfr')())
File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 117, in main
File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 250, in process_run
File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/__main__.py", line 229, in _process_common
File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/__init__.py", line 12, in get_backend
File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/cuda/base.py", line 20, in __init__
File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/backends/cuda/driver.py", line 209, in __init__
File "/home/x_zhyua/.conda/envs/pyfr13_env/lib/python3.9/site-packages/pyfr-1.13.0-py3.9.egg/pyfr/ctypesutil.py", line 33, in _errcheck
pyfr.backends.cuda.driver.CUDAOSError
This case is what I am running for last two months and everything works very fine. It should be something related to the cluster setting since I have very same problem even running example cases. I cannot get immediately help from cluster support and maybe you have idea what this error is. Thank you in ahead.
It is hard to say; CUDA OS errors are extremely generic. I would suggest seeing if you can successfully run any other CUDA applications on the cluster; ideally those which make use of the driver API.
Given the error that you are getting I think PyFR is able to find the CUDA library, I think you would get OSError: Unable to load cuda if it couldn’t.
Given that it fails on the init I might be that when you ran PyFR no GPUs where allocated to the job? Do you see this problem if you try running one of the examples on an instance that your above test program works on?
This is most likely your issue. If OpenMPI has been built (or tries to load) a different version of CUDA to what PyFR loads, then you will likely experience an error.
and that worked some days ago. Although cuda is not required when building, it is the only version of cuda can be used by PyFR. I noticed this version of cuda built time by checking nvcc --version, it was almost a year ago.
In order to try to figure out, I rebuilt PyFR and run it under different versions of cuda which all have the very same error.
PyFR is not ‘built’ and it does not make any use of nvcc. Rather, we load the active CUDA shared library at runtime. As I said before, the issue is likely to do with your MPI library. Please recompile this either without support for CUDA or against your desired version of CUDA.
Can you confirm that mpi4py has also been recompiled against this new MPI library? (And similarly, HDF5 if you are compiling it with parallel I/O support.)
But I should stress that if a CUDA API is returning an error that this is a CUDA problem, and should probably be directed towards the NVIDIA support forums (who can likely better explain the circumstances which would lead to cuInit returning the value it does.)