Problem with CUDA when installing PyFR on Windows10

Dear all,

I was trying to install PyFR.1.12.2 on a new workstation, equipped with AMD Ryzen Threadripper 3970X 32-Core Processor and Nvidia GeForce RTX 2070 SUPER running Windows 10. I could have made it successfully on a Linux system with a virtual machine such as VMware, but due to the lack of PCIe passthrough, I couldn’t get the best performance with the help of GPUs. Thus, I would be certainly loved to give it a try, following the installation steps that Nolan_Dyck posted. I built a virtual environment under Anaconda and debug the code with Pycharm, using pip install -e . as the post illustrated. However, every compilation went well until the 13th step Test the couette_flow_2d example in pyfr!. The error showed that the cuda.dll couldn’t be found in the code, while I searched nothing relates to the cuda.dll through the whole CUDA path directory either.

I had installed all the dependencies pyfr needs. The CUDA Toolkit11.1 was installed and it’s lib was also included in the path. So it’s quite confusing that CUDA module didn’t work.

It should be noted that I had modified the backend components in ini file as follows:

[backend-cuda]
device-id = local-rank
mpi-type = cuda-aware
block-1d = 64
block-2d = 128

The error was like the following:

PS G:\PyFR\pyfr\examples\couette_flow_2d> pyfr run -b cuda -p .\couette.pyfrm .\couette_flow_2d.ini
Traceback (most recent call last):
  File "g:\pyfr\pyfr\pyfr\ctypesutil.py", line 57, in load_library
    return ctypes.CDLL(lname)
  File "G:\software\envs\pyfr_tf\lib\ctypes\__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'cuda.dll' (or one of its dependencies). Try using the full path with constructor syntax.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:\software\envs\pyfr_tf\Scripts\pyfr-script.py", line 33, in <module>
    sys.exit(load_entry_point('pyfr', 'console_scripts', 'pyfr')())
  File "g:\pyfr\pyfr\pyfr\__main__.py", line 117, in main
    args.process(args)
  File "g:\pyfr\pyfr\pyfr\__main__.py", line 250, in process_run
    _process_common(
  File "g:\pyfr\pyfr\pyfr\__main__.py", line 226, in _process_common
    backend = get_backend(args.backend, cfg)
  File "g:\pyfr\pyfr\pyfr\backends\__init__.py", line 12, in get_backend
    return subclass_where(BaseBackend, name=name.lower())(cfg)
  File "g:\pyfr\pyfr\pyfr\backends\cuda\base.py", line 20, in __init__
    self.cuda = CUDA()
  File "g:\pyfr\pyfr\pyfr\backends\cuda\driver.py", line 233, in __init__
    self.lib = CUDAWrappers()
  File "g:\pyfr\pyfr\pyfr\ctypesutil.py", line 14, in __init__
    lib = load_library(self._libname)
  File "g:\pyfr\pyfr\pyfr\ctypesutil.py", line 66, in load_library
    raise OSError(f'Unable to load {name}')
OSError: Unable to load cuda

I was wondering if anyone ever met a similar problem when running PyFR on windows10? Thanks for any advices!

Regards, Thatcher

Locate the directory with cuda.dll. Then before running PyFR export the environment variable

set PYFR_LIBRARY_PATH="C:\path\to\dir"

and we will search this directory for cuda.dll. More precisely, you can also do (bash export notation)

set PYFR_CUDA_LIBRARY_PATH="C:\path\to\exact\location\cuda.dll"

Note that you may also need to specify the path to cublas, too. This can be done via PYFR_CUBLAS_LIBRARY_PATH.

Regards, Freddie.

Hi Freddie,

Thanks for your kind reply! And I tried setting the environment variable both in global system and the virtual environment. However, the error still exists, and the loaded module changed like this:

FileNotFoundError: Could not find module 'G:\NVIDIA_CUDA\development\bin\' (or one of its dependencies). Try using the full path with constructor syntax.

The 'G:\NVIDIA_CUDA\development\bin\'is the path that contains both cublas64_11.dll (I changed it to cublas.dll manually) and cudart64_110.dll (I changed it to cuda.dll manually). When I set the path to the DLL file directly, another error came up saying it couldn’t find the cuinj.dll.

After searching for a while, I found that starting with Python 3.8, the .dll search mechanism has changed and setting the environment path could not work anymore as a consequence, which could be the cause. See this post (windows - PyWin32 and Python 3.8.0 - Stack Overflow) and the docs (ctypes — A foreign function library for Python — Python 3.9.6 documentation) for more details. I also tried to add this line to the Line49 in ctypesutil.py, but with no luck.

os.add_dll_directory(lpath)

Regards, Thatcher

You want to get the following snippet of code working.

import ctypes

hllDll = ctypes.CDLL("C:\\Path\\To\\CUDA\\v10.0\\bin\\cudart64_xyz.dll")

With the path changed appropriately. Once you know what path works you can do

set PYFR_CUDA_LIBRARY_PATH="..."

and should not need to have to rename anything.

Regards, Freddie.

What a prompt response, thanks! I filled the path directory to make the snippet of code working, and I put both the CUDA and CUBLAS into the environment variables. It’s noted that typing set PYFR_CUDA_LIBRARY_PATH = ... in Pycharm just doesn’t work, so I have to set them through system properities. Anyway, it seems that the DLL file could be found favourably, but another error just come up as following:

Traceback (most recent call last):
  File "G:\software\envs\pyfr_tf\Scripts\pyfr-script.py", line 33, in <module>
    sys.exit(load_entry_point('pyfr', 'console_scripts', 'pyfr')())
  File "g:\pyfr\pyfr\pyfr\__main__.py", line 117, in main
    args.process(args)
  File "g:\pyfr\pyfr\pyfr\__main__.py", line 250, in process_run
    _process_common(
  File "g:\pyfr\pyfr\pyfr\__main__.py", line 226, in _process_common
    backend = get_backend(args.backend, cfg)
  File "g:\pyfr\pyfr\pyfr\backends\__init__.py", line 12, in get_backend
    return subclass_where(BaseBackend, name=name.lower())(cfg)
  File "g:\pyfr\pyfr\pyfr\backends\cuda\base.py", line 20, in __init__
    self.cuda = CUDA()
  File "g:\pyfr\pyfr\pyfr\backends\cuda\driver.py", line 233, in __init__
    self.lib = CUDAWrappers()
  File "g:\pyfr\pyfr\pyfr\ctypesutil.py", line 17, in __init__
    fn = getattr(lib, fname)
  File "G:\software\envs\pyfr_tf\lib\ctypes\__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "G:\software\envs\pyfr_tf\lib\ctypes\__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'cuInit' not found

Regards, Thatcher

Yes, that is due to us loading the CUDA runtime DLL as opposed to the driver DLL. The DLL we want is typically comes with the NVIDIA driver as opposed to CUDA. It should be called something like cuda.dll.

Regards, Freddie.

Hi Thatcher,

On Win10, the cuda.dll we want seems to be C:/Windows/System32/nvcuda.dll

To get started, in the file: /pyfr/backends/cuda/driver.py, I change ‘cuda’ to ‘nvcuda’

line 29: class CUDAWrappers(LibWrapper):
line 30: # _libname = ‘cuda’
line 31: _libname = ‘nvcuda’

Then in the Win10 [Environment Variables], I add this to “User variables”:

[variable name] PYFR_NVCUDA_LIBRARY_PATH
[variable value] C:/Windows/System32/nvcuda.dll

Next we need nvrtc.dll and cublas.dll

I installed the CUDA SDK in [C:/VS/CUDA], so I also add these Environment variables:

[variable name] PYFR_NVRTC_LIBRARY_PATH
[variable value] C:/VS/CUDA/bin/nvrtc.dll

[variable name] PYFR_CUBLAS_LIBRARY_PATH
[variable value] C:/VS/CUDA/bin/cublas.dll

To simplify things, like you I made copies to help Python find the right DLL’s…

copy cublas64_11.dll → cublas.dll
copy nvrtc64_112_0.dll → nvrtc.dll

To check these paths, from a console window type: set
I see the following:

PYFR_CUBLAS_LIBRARY_PATH=C:/VS/CUDA/bin/cublas.dll
PYFR_NVCUDA_LIBRARY_PATH=C:/Windows/System32/nvcuda.dll
PYFR_NVRTC_LIBRARY_PATH=C:/VS/CUDA/bin/nvrtc.dll

With the above 3 settings, I have the current git develop version of PyFR running on a fresh 3975WX CPU and a classic Kepler Titan GPU.

PS: if you’re running PyFR as a Python project inside Visual Studio, you’ll need to restart Visual Studio to get new environment variables.

good luck!
Nigel

Hi Nigel,

Thanks for your kind reply! It’s very interesting that I JUST received your reply as I was typing. The measures of success are similar to what you said. It seems that three environment variables are needed (NVCUDA/CUBLAS/NVRTC), each of them are located just as you listed. Also, it’s noteworthy that the snippet of code fdw shared is very useful to determine whether the path could be reachable.

After setting the environment variables right, I can run the example simulation except for the progress bar problem. I was wondering if you have met a similar problem before, and how did you solve such problem?

The unusual progress bar showed like the following:

?[2K?[G   0.0% [>                                                                                                                                                                                          ] 0.00/4.00 ela: 00:00:00 rem: 1
0:26:47?[2K?[G   0.0% [>                                                                                                                                                                                          ] 0.00/4.00 ela: 00:00:00
 rem: 00:41:50?[2K?[G   0.1% [>                                                                                                                                                                                          ] 0.00/4.00 ela: 0
0:00:00 rem: 00:18:31?[2K?[G   0.1% [>                                                                                                                                                                                          ] 0.00/4.00
 ela: 00:00:00 rem: 00:13:19?[2K?[G   0.1% [>                                                                                                                                                                                          ] 0.
00/4.00 ela: 00:00:00 rem: 00:11:02?[2K?[G   0.2% [>
   ] 0.01/4.00 ela: 00:00:00 rem: 00:09:45?[2K?[G   0.2% [>
          ] 0.01/4.00 ela: 00:00:00 rem: 00:08:55?[2K?[G   0.2% [>
                 ] 0.01/4.00 ela: 00:00:01 rem: 00:08:20?[2K?[G   0.3% [>
                        ] 0.01/4.00 ela: 00:00:01 rem: 00:07:54?[2K?[G   0.3% [=>
                               ] 0.01/4.00 ela: 00:00:01 rem: 00:07:36?[2K?[G   0.3% [=>
...

PS. I tried adding the following lines to progress_bar.py, but it still repeated scrolling.

if sys.platform == 'win32':
    sys.stdout.write('\b' *80)
    sys.stdout.write('\r')
    sys.stdout.write(s)
    sys.stdout.flush()
else:
    sys.stderr.write('\x1b[2K\x1b[G')
    sys.stderr.write(s)
    sys.stderr.flush()

Regards, Thatcher

That progress bar remains a fun challenge. When setting this up yesterday, I found that reducing the width of the text block allows the expected behavior (at least for the current incarnation of the Win10 console window).

In file: /pyfr/progress_bar.py, line 37, I subtract 2 from self._ncol:

    # self._ncol = shutil.get_terminal_size()[0] or 80
    self._ncol = shutil.get_terminal_size()[0]-2 or 80

For me, this stops the scrolling and line-wrapping.

For interest, at the end of the file where we update the progress bar, I find the following one-liner is giving a satisfactory output on Win10:

if sys.platform == 'win32':
    sys.stdout.write('\r' + s),
else:
    sys.stderr.write('\x1b[2K\x1b[G')
    sys.stderr.write(s)
    sys.stderr.flush()

Cool! It works as expected when I shorten the self._ncol and add a new line at the end of the file as you suggested. The progress bar now looks like this:

0.3% [=>                               ] 40.03/50.00 ela: 00:02:56 rem: 15:14:06

I now can experience the dramatic strength of GPUs, contrary to more than 200hrs by CPUs with 8 threads, it shortens the time by around x14! Maybe I can speed up a little more by combination of GPUs and CPU.

Regards, Thatcher

I am unsure why you need to rename the libraries or change any code in PyFR. If you set

set PYFR_CUDA_LIBRARY_PATH=C:/Windows/System32/nvcuda.dll

it should pick it up fine.

Regards, Freddie.

Hi Nigel,

I was trying to compare the performance using different backend, the CUDA and the OpenMP. The compilation with CUDA backend went well as the way you replied. However, while I ran the simulation with OpenMP backend, an error came up as

Traceback (most recent call last):
  File "e:\pyfr_test\pyfr\util.py", line 33, in __call__
    res = cache[key]
KeyError: (<function OpenMPKernelProvider._build_kernel at 0x0000020D34FE7EE0>, b'\x80\x04\x955\x03\x00\x00\x00\x00\x00\x00\x8c\nbatch_gemm\x94X\xf3\x02\x00\x00\n\n#include <omp.h>\n#include <stdlib.h
>\n#include <tgmath.h>\n\n#define SOA_SZ 8\n#define BLK_SZ 8\n\n#define min(a, b) ((a) < (b) ? (a) : (b))\n#define max(a, b) ((a) > (b) ? (a) : (b))\n\n// Typedefs\ntypedef double fpdtype_t;\n\n\n\n//
 libxsmm prototype\ntypedef void (*libxsmm_xfsspmdm_execute)(void *, const fpdtype_t *,\n                                         fpdtype_t *);\n\n// gimmik prototype\ntypedef void (*gimmik_execute)(i
nt, const fpdtype_t *, int, fpdtype_t *, int);\n\nvoid\nbatch_gemm(gimmik_execute exec, int bldim,\n           int nblocks,\n           const fpdtype_t *b, int bblocksz, fpdtype_t *c, int cblocksz)\n{
\n    #pragma omp parallel for\n    for (int ib = 0; ib < nblocks; ib++)\n        exec(bldim, b + ib*bblocksz, bldim, c + ib*cblocksz, bldim);\n}\n\n\x94]\x94(\x8c\x05numpy\x94\x8c\x05int64\x94\x93\x9
4h\x03\x8c\x05int32\x94\x93\x94h\x07h\x05h\x07h\x05h\x07e\x87\x94.', b'\x80\x04}\x94.')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\Anaconda\envs\pyfr_tf\Scripts\pyfr-script.py", line 33, in <module>
    sys.exit(load_entry_point('pyfr', 'console_scripts', 'pyfr')())
  File "e:\pyfr_test\pyfr\__main__.py", line 117, in main
    args.process(args)
  File "e:\pyfr_test\pyfr\__main__.py", line 250, in process_run
    _process_common(
  File "e:\pyfr_test\pyfr\__main__.py", line 232, in _process_common
    solver = get_solver(backend, rallocs, mesh, soln, cfg)
  File "e:\pyfr_test\pyfr\solvers\__init__.py", line 16, in get_solver
    return get_integrator(backend, systemcls, rallocs, mesh, initsoln, cfg)
  File "e:\pyfr_test\pyfr\integrators\__init__.py", line 36, in get_integrator
    return integrator(backend, systemcls, rallocs, mesh, initsoln, cfg)
  File "e:\pyfr_test\pyfr\integrators\std\controllers.py", line 13, in __init__
    super().__init__(*args, **kwargs)
  File "e:\pyfr_test\pyfr\integrators\std\base.py", line 27, in __init__
    self.system = systemcls(backend, rallocs, mesh, initsoln,
  File "e:\pyfr_test\pyfr\solvers\base\system.py", line 68, in __init__
    self._gen_kernels(eles, int_inters, mpi_inters, bc_inters)
  File "e:\pyfr_test\pyfr\solvers\base\system.py", line 187, in _gen_kernels
    kernels[pn, kn].append(kgetter())
  File "e:\pyfr_test\pyfr\solvers\baseadvec\elements.py", line 45, in <lambda>
    kernels['disu'] = lambda: self._be.kernel(
  File "e:\pyfr_test\pyfr\backends\base\backend.py", line 163, in kernel
    return kern(*args, **kwargs)
  File "e:\pyfr_test\pyfr\backends\openmp\gimmik.py", line 48, in mul
    batch_gemm = self._build_kernel('batch_gemm', src, argt)
  File "e:\pyfr_test\pyfr\util.py", line 35, in __call__
    res = cache[key] = self.func(*args, **kwargs)
  File "e:\pyfr_test\pyfr\backends\openmp\provider.py", line 13, in _build_kernel
    mod = SourceModule(src, self.backend.cfg)
  File "e:\pyfr_test\pyfr\backends\openmp\compiler.py", line 65, in __init__
    self.mod = self._cache_set_and_loadlib(lpath)
  File "e:\pyfr_test\pyfr\backends\openmp\compiler.py", line 130, in _cache_set_and_loadlib
    return CDLL(clpath)
  File "E:\Anaconda\envs\pyfr_tf\lib\ctypes\__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\Thatcher\AppData\Local\pyfr\pyfr\Cache\c8188542ba03ceef8f44731c24cb91566c3cecd078454acb4da2577a34372d9f.dll' (or one of its dependencies). Try using the
full path with constructor syntax.

Actually, I searched in this forum and found several questions relates to a similar problem, but all of them seems to exist with older versions. With years of update, I believe some sort of questions or bugs may be solved with renovation. During the compilation of OpenMP, I installed MS-MPI, GCC (by MinGW) and mpi4py, all of them are added to the environment. (Ref. Installation Steps)

I have to admit that libxsmm.dll is absent, still have some problems to make it available. But to me, the above error had nothing to do with libxsmm, for it will work under certain dense and sparse matrix operations, clearly not suitable to the situation. So I was wondering if I was missing something? Or anything to be careful with these steps?

PS. I think it’s better for me to rename this topic since the problem extends to how to compile and run PyFR on Windows10 both with CUDA and OpenMP. :grinning:

I think the critical point is to set the name of environment variables the same as what the _libname in driver.py assigned to. It went the same as long as the names can be matched.

_libname = '{name}'
------------
set PYFR_{name}_LIBRARY_PATH=the/path/to/nvcuda.dll

Only when the three key variables were set right, should the compilation be fine.

Regards, Thatcher

Yes, so the default in PyFR for CUDA is _libname = 'cuda'. Hence, modifying this variable in PyFR (which requires changes to the code) is unnecessary and just adds complexity compared with simply setting PYFR_CUDA_LIBRARY_PATH=.../nvcuda.dll.

Regards, Freddie.

Thanks Freddie, agreed! Now that we know which cuda dll’s we need, and given the challenge of hunting for dll’s on a Windows system, maybe there could be a note in the docs advising Windows users to set (appropriately) these three environment variables:

PYFR_CUDA_LIBRARY_PATH=C:/Windows/System32/nvcuda.dll
PYFR_CUBLAS_LIBRARY_PATH=X:/path/to/CUDA/bin/cublas64_11.dll
PYFR_NVRTC_LIBRARY_PATH=X:/path/to/CUDA/bin/nvrtc64_112_0.dll

To Thatcher: regarding setup for the OpenMP backend, I’ve just got this working, and the couette_flow_2d example gives the expected result.

However, I use the native Microsoft tools (Visual Studio) to build the xsmm library, plus the xsmm and gimmik kernels; so this proved to be non-trivial. Given the way VC exports functions from dll’s, and its openmp peculiarities, there are a handful of issues (easily fixed with some adjustments to the mako kernels).

If you’d like to start a thread about problems you’ve found setting up the OpenMP backend on Windows, I’ll try to assist with possible solutions. Together, we might be able to streamline all this for the Win64 community!

Nigel

1 Like

Hi Nigel,

I’ve done the first step. But in the [C:/VS], I can’t find the CUDA file. Could you please tell me how to installed the CUDA SDK in [C:/VS/CUDA]? I tried to reinstall CUDA but there is no CUDA in [C:/VS].

Regards, Edmund.

Hi Edmund,

I think I could probably answer your question. Since you have done the first step, which means the nvcuda.dll is available for compilation, then the next step is to download CUDA Toolkit and install CUDA. During the installation, you can change the directories of three subfolders: development, documentation, samples. And the directory path is just the place where you can find your two other needed DLL files (default path: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4).

Notes: You need to check your driver version before choosing the newest and suitable edition.(Check the Release Notes :: CUDA Toolkit Documentation)

Regards, Thatcher

1 Like

Hi Thatcher,

Thank you for the prompt reply. I tried this method. But, Python still cannot find the DLL’s after I added these Environment variables. This may be a problem with my settings. Thank you for the help.

Traceback (most recent call last): File "E:\pythonfile\couette_flow_2d\111.py", line 6, in <module> www = ctypes.CDLL('cublas64_11.dll') File "E:\python\lib\ctypes\__init__.py", line 374, in __init__ self._handle = _dlopen(self._name, mode) FileNotFoundError: Could not find module 'cublas64_11.dll' (or one of its dependencies). Try using the full path with constructor syntax.

Regards, Edmund.

Hi Edmund,

Assuming you can actually see the cublas*.dll on your system, and that the PYFR_CUBLAS_LIBRARY_PATH environment variable is set correctly, there’s one more step:

If you are running pyfr from a command line window, after setting environment variables you need to close and start a fresh console window to pick up the new environment variables.

Likewise if running pyfr inside Visual Studio, restart Visual Studio for it to discover the new environment variables.

PS: I can recommend running PyFR inside Visual Studio; this allows you to step through the python code line by line, to see exactly what happens when. Which can be scary, but valuable if ever you want to adjust the code.

Nigel

Hi Nigel,

Thank you for your help. After I restart the window, the error is resolved.

Edmund