Error: CUDAOutofMemory

I was trying to simulate the flow around an airfoil at Reynolds=2.32M using 32 GPU (8 nodes) on an HPC. The cluster compute nodes are 32 Dell C4140s, each equipped with 4 NVIDIA V100 GPUs, 2 Intel Gen 2 Xeon Gold CPUs, and 2 SATA disks 480 GB solid-state. Each node is also equipped with 22 64 GB RAM memory modules, overall 1.375 TiB (1408 GB). Each GPU is equipped with 34 GB RAM memory. The mesh is third-order, consisting of 3 million hexahedra and about 82 million solution points . The mesh file is approximately 7 GB in size.
After importing and partitioning the mesh, the simulation start and then fails or crashes immediately. with the following error:


           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/home/user/pyfr-venv/lib/python3.12/site-packages/pyfr/backends/cuda/base.py", line 109, in _malloc_impl
    data = self.cuda.mem_alloc(nbytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/home/user/pyfr-venv/lib/python3.12/site-packages/pyfr/backends/cuda/driver.py", line 485, in mem_alloc
    return CUDADevAlloc(self, nbytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/home/user/pyfr-venv/lib/python3.12/site-packages/pyfr/backends/cuda/driver.py", line 223, in __init__
    cuda.lib.cuMemAlloc(ptr, nbytes)
  File "/lustre/home/user/pyfr-venv/lib/python3.12/site-packages/pyfr/ctypesutil.py", line 37, in _errcheck
    raise self._statuses[status]
pyfr.backends.cuda.driver.CUDAOutofMemory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 26 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 746141.0 ON hpc-wn10 CANCELLED AT 2025-05-20T15:35:53 ***
srun: error: hpc-wn31: tasks 25,27: Killed
srun: error: hpc-wn31: task 26: Exited with exit code 1
srun: error: hpc-wn32: tasks 28-29,31: Killed
srun: error: hpc-wn31: task 24: Killed
srun: error: hpc-wn27: tasks 12-13,15: Killed
srun: error: hpc-wn10: tasks 0,2-3: Killed
srun: error: hpc-wn28: tasks 17-19: Killed
srun: error: hpc-wn32: task 30: Killed
srun: error: hpc-wn26: tasks 8-9,11: Killed
srun: error: hpc-wn29: tasks 20-23: Killed
srun: error: hpc-wn28: task 16: Killed
srun: error: hpc-wn26: task 10: Killed
srun: error: hpc-wn23: tasks 5-7: Killed
srun: error: hpc-wn23: task 4: Killed
slurmstepd: error:  mpi/pmix_v3: _errhandler: hpc-wn27 [3]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.746141.0:12]
slurmstepd: error:  mpi/pmix_v3: _errhandler: hpc-wn10 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.746141.0:0]
srun: error: hpc-wn27: task 14: Killed
srun: error: hpc-wn10: task 1: Killed
                                               

this is the script that I use to start the simulation

#!/bin/bash
#SBATCH --nodes=8       # number of nodes
#SBATCH --partition=gpus  # partition name
#SBATCH --error=myJob.err
#SBATCH --gres=gpu:4
#SBATCH --output=myJob.out
#SBATCH --gpu-bind=closest
#SBATCH --ntasks-per-gpu=1

export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
#pyfr -p  import NLR.msh NLR.pyfrm

pyfr -p partition 32 -p scotch NLR.pyfrm .
srun  -n 32  stdbuf -oL -eL pyfr -p run -b cuda NLR.pyfrm NLR.ini


while in the .ini file I use:
[backend]
precision = single
[backend-cuda]
mpi-type = cuda-aware
device-id = local-rank
I also used h5ls NLR.pyfrs and the partition seems to be well balanced as shown below:

h5ls NLR_0.0001000000.pyfrs 
config                   Dataset {SCALAR}
config-0                 Dataset {SCALAR}
mesh_uuid                Dataset {SCALAR}
soln_hex_p0              Dataset {64, 5, 191726}
soln_hex_p1              Dataset {64, 5, 191776}
soln_hex_p10             Dataset {64, 5, 195291}
soln_hex_p11             Dataset {64, 5, 192212}
soln_hex_p12             Dataset {64, 5, 195398}
soln_hex_p13             Dataset {64, 5, 192109}
soln_hex_p14             Dataset {64, 5, 195327}
soln_hex_p15             Dataset {64, 5, 194558}
soln_hex_p2              Dataset {64, 5, 194928}
soln_hex_p3              Dataset {64, 5, 192004}
soln_hex_p4              Dataset {64, 5, 192012}
soln_hex_p5              Dataset {64, 5, 192668}
soln_hex_p6              Dataset {64, 5, 195094}
soln_hex_p7              Dataset {64, 5, 192714}
soln_hex_p8              Dataset {64, 5, 194386}
soln_hex_p9              Dataset {64, 5, 192557}
stats                    Dataset {SCALAR}

(this is for 16 partition but is the same for 32).
Could it be related to the large number of elements per GPU?
Are there any recommended strategies for fixing this problem?

Thanks in advance for any help or insight

What polynomial order are you attempting to run the simulation at?

Regards, Freddie.

I’m trying order = 3.

So 3 million \wp = 3 hexahedra at single precision should, for the Navier-Stokes equations, require about 58 GiB of memory in total. Thus, with 16 or 32 V100’s you should have more than enough.

I would first check that no one else is using the GPUs and that they do have the expected amount of free memory. Then, I would verify the element count just to make sure that you’re not off by a factor somewhere.

Regards, Freddie.

With this configuration, the problem seems to be solved :

[backend]
rank-allocator=linear
precision = single
[backend-cuda]
mpi-type = standard
device-id =0
;[backend-openmp]
;cc = gcc

[constants]
gamma = 1.4
u_dim = 200.0
rho_dim =1.225
p_dim = 48442.906574394474
T_dim = 137.78825734023883
mu =0.0001056034482758620
alpha = 0
Pr    = 0.72
M     = 0.85
Re    = 2.32e6
cpTs=110726.64
cpTref=88396.164
[solver-time-integrator]
scheme = euler
tstart = 0.0
dt   = 1e-8
tend = 0.2
controller = none
atol = 0.00001
rtol = 0.00001
errest-norm = l2
safety-fact = 0.9
min-fact = 0.3
max-fact = 2.5

[soln-plugin-writer]
dt-out = 0.5e-2
basedir =  sol/.
basename =NLR_{t:.10f}
post-action = echo "Wrote file {soln} at time {t} for mesh {mesh}."

[solver]
system = navier-stokes
order  = 3
viscosity-correction =sutherland
shock-capturing =entropy-filter

[solver-entropy-filter]
d-min = 1e-6
p-min = 1e-6
e-tol = 1e-6
niters = 3
formulation = nonlinear
;e-func = physical

[solver-interfaces]
riemann-solver = exact
ldg-beta = 0.5
ldg-tau = 0.1

[solver-interfaces-line]
flux-pts = gauss-legendre-lobatto
quad-deg = 9
quad-pts = gauss-legendre-lobatto

[solver-interfaces-quad]
flux-pts = gauss-legendre-lobatto
quad-deg = 9
quad-pts = gauss-legendre-lobatto

[solver-elements-quad]
soln-pts = gauss-legendre-lobatto
quad-deg = 9
quad-pts = gauss-legendre-lobatto

[solver-elements-hex]
soln-pts = gauss-legendre-lobatto
quad-deg = 9
quad-pts = gauss-legendre-lobatto

[soln-bcs-inlet]
type = char-riem-inv
rho = rho_dim
u = u_dim*cos((alpha*3.14/180))
v = u_dim*sin(alpha*3.14/180)
p = p_dim
w = 0

[soln-bcs-outlet]
type = char-riem-inv
rho = rho_dim
u = u_dim*cos((alpha*3.14/180))
v = u_dim*sin(alpha*3.14/180)
p = p_dim
w = 0
[soln-bcs-wall]
type = no-slp-adia-wall

[soln-ics]
rho = rho_dim
u = u_dim*cos((alpha*3.14/180))
v = u_dim*sin(alpha*3.14/180)
p = p_dim
w = 0

[soln-plugin-fluidforce-wall]
nsteps = 1000
file = force_NLR.csv
header = true
quad-deg = 6
morigin = (0.0,0.0, 0.0)

[soln-plugin-residual]
nsteps = 1000
file = residual_NLR.csv
header = true
norm = inf

[soln-plugin-nancheck]
nsteps = 90

To run the simulation on the HPC, I’m using the script below:`

#!/bin/bash
#SBATCH --nodes=8       # number of nodes
#SBATCH --partition=parallel  # partition name
#SBATCH --error=myJob.err
#SBATCH --output=myJob.out
#SBATCH --ntasks=32
#SBATCH --gres=gpu:4
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
#export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
#pyfr -p  import NLR.msh NLR.pyfrm

pyfr -p partition 32 -p scotch NLR.pyfrm .
export PATH=/lustre/home/user/gcc-13/bin:$PATH
export LD_LIBRARY_PATH=/lustre/home/user/gcc-13/lib64:$LD_LIBRARY_PATH
export OMP_NUM_THREADS=2
export PATH=/usr/mpi/gcc/openmpi-4.1.0rc5/bin:$PATH
export LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.0rc5/lib64:$LD_LIBRARY_PATH
mpiexec  --bind-to core --map-by ppr:4:node -np 32 stdbuf -oL -eL pyfr -p run -b cuda  NLR.pyfrm NLR.ini     

However, with this setup, the wall-time is around 350 hours, which is difficult to handle for my case. I’ve experimented with various [backend-cuda] settings, but the simulation still runs very slowly. I’d greatly appreciate if anyone has suggestions on how to improve the .ini file or the execution script that could help reduce simulation time. I’m also wondering if there might be any common issues or inefficiencies in my setup that could explain such long runtimes.

Regards,
Luca_P

I would check that all four GPUs on each node are actually being used. You can check this by logging into one of the nodes and running nvidia-smi while PyFR is running.

Regards, Freddie.