CUBLASInternalError when using multi GPU

Dear all,

I got an error when trying to use 2 different GPUs.

The command I used is mpiexec -n 2 pyfr restart -b cuda -p solutions_01/sd7003.pyfrm solutions_01/sd7003_145.00.pyfrs sd7003_01.ini.

Traceback (most recent call last):
  File "/home/tang/anaconda3/envs/pyfr1.15.0/bin/pyfr", line 33, in <module>
    sys.exit(load_entry_point('pyfr==1.15.0', 'console_scripts', 'pyfr')())
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/", line 118, in main
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/", line 270, in process_restart
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/", line 247, in _process_common
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/integrators/", line 115, in run
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/integrators/std/", line 181, in advance_to
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/integrators/std/", line 190, in step
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/solvers/base/", line 265, in rhs
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/", line 40, in newmeth
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/solvers/baseadvecdiff/", line 82, in _rhs_graphs
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/backends/base/", line 332, in add
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/backends/cuda/", line 35, in add_to_graph
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/backends/cuda/", line 35, in <listcomp>
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/backends/cuda/", line 103, in add_to_graph
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/backends/cuda/", line 111, in run
  File "/home/tang/anaconda3/envs/pyfr1.15.0/lib/python3.10/site-packages/pyfr-1.15.0-py3.10.egg/pyfr/", line 33, in _errcheck
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

Here are the details of my devices:


The configure file I used:


precision = single

device-id = local-rank
gimmik-max-nnz = 1024
mpi-type = standard

gamma = 1.4
mu = 3.94405318873308E-6
Pr = 0.72
M = 0.2

scheme = rk45
controller = pi
tstart = 0.0
dt = 0.00001
atol = 0.000001
rtol = 0.000001
; safety-fact = 0.5
; min-fact = 0.3
; max-fact = 1.2
tend = 185.0

nsteps = 50

dt-out = 5.0
basedir = ./solutions_01/
basename = sd7003_{t:.2f}

nsteps = 5
file = sd7003_01-wall-forces.csv
header = true

system = navier-stokes
order = 4
anti-alias = flux

riemann-solver = rusanov
ldg-beta = 0.5
ldg-tau = 0.1

flux-pts = gauss-legendre
quad-deg = 11
quad-pts = gauss-legendre

flux-pts = gauss-legendre
quad-deg = 11
quad-pts = gauss-legendre

soln-pts = gauss-legendre
quad-deg = 11
quad-pts = gauss-legendre

type = char-riem-inv
rho = 1.0
u = 0.2366431913
v = 0.0
w = 0.0
p = 1.0

type = char-riem-inv
rho = 1.0
u = 0.2366431913
v = 0.0
w = 0.0
p = 1.0

type = no-slp-adia-wall
cpTw = 3.5

rho = 1.0
u = 0.2366431913
v = 0.001
w = 0.001*cos(x)*cos(y)
p = 1.0

I have tested with the 2D Euler Vortex example with 2 GPUs, and things are fine. And single GPU is also OK with the same configuration file. So I do not figure out what I did wrong. Could you give me some suggestions, please?

OS: ubuntu 20.04
PyFR version: 1.15.0

If PyFR works on both GPUs individually the first thing I would check is if you might be running out of memory on one of the GPUs. This not not raise an internal error, but it is something we want to rule out.

Regards, Freddie.

Hi Freddie,

Thank you for your reply. I checked the simulation on every single GPU and found the error was caused by GPU 0 (Tesla K40c). The simulation can run successfully on another GPU 1. The simulation takes 9093 MB memory on GPU 1. So neither of the GPUs is running out of memory.

I also did a test using PyFR 1.12.3. Surprisingly, the simulation can run successfully using two GPUs. And it is OK using either GPU.

The simulation is the sd7003 case obtained from this paper.

BTW, I have a novice question:
if one single GPU has enough memory to do a simulation, can the simulation be accelerated with multi GPUs? The answer seems to be yes if I do not misunderstand these discussions. And could you also recommend some GPU programming materials for beginners, please?

Best regards.

I imagine waht the issue is is that the graph API stuff that has ben added is not support on K40 GPUs, which are now quite old. I know for example that the newer nvidia profilers don’t support these.

In answer to your second question, generally stong scaling will always speed up your calculation, unless you strong scale it far too far. Each GPU needs enough to do to occupy the warps sufficiently, and for the overhead of launching kernels to not be the major time use, aswell as to fill the time when communication is happening.

NVIDIA’s blogs are a good source of material as this paper, but as Volta is a couple generation old now it is less relivant.

OK, I get it. Thank you very much.