I was trying to simulate the flow around an airfoil at Reynolds=2.32M using 32 GPU (8 nodes) on an HPC. The cluster compute nodes are 32 Dell C4140s, each equipped with 4 NVIDIA V100 GPUs, 2 Intel Gen 2 Xeon Gold CPUs, and 2 SATA disks 480 GB solid-state. Each node is also equipped with 22 64 GB RAM memory modules, overall 1.375 TiB (1408 GB). Each GPU is equipped with 34 GB RAM memory. The mesh is third-order, consisting of 3 million hexahedra and about 82 million solution points . The mesh file is approximately 7 GB in size.
After importing and partitioning the mesh, the simulation start and then fails or crashes immediately. with the following error:
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/home/user/pyfr-venv/lib/python3.12/site-packages/pyfr/backends/cuda/base.py", line 109, in _malloc_impl
data = self.cuda.mem_alloc(nbytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/home/user/pyfr-venv/lib/python3.12/site-packages/pyfr/backends/cuda/driver.py", line 485, in mem_alloc
return CUDADevAlloc(self, nbytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/home/user/pyfr-venv/lib/python3.12/site-packages/pyfr/backends/cuda/driver.py", line 223, in __init__
cuda.lib.cuMemAlloc(ptr, nbytes)
File "/lustre/home/user/pyfr-venv/lib/python3.12/site-packages/pyfr/ctypesutil.py", line 37, in _errcheck
raise self._statuses[status]
pyfr.backends.cuda.driver.CUDAOutofMemory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 26 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 746141.0 ON hpc-wn10 CANCELLED AT 2025-05-20T15:35:53 ***
srun: error: hpc-wn31: tasks 25,27: Killed
srun: error: hpc-wn31: task 26: Exited with exit code 1
srun: error: hpc-wn32: tasks 28-29,31: Killed
srun: error: hpc-wn31: task 24: Killed
srun: error: hpc-wn27: tasks 12-13,15: Killed
srun: error: hpc-wn10: tasks 0,2-3: Killed
srun: error: hpc-wn28: tasks 17-19: Killed
srun: error: hpc-wn32: task 30: Killed
srun: error: hpc-wn26: tasks 8-9,11: Killed
srun: error: hpc-wn29: tasks 20-23: Killed
srun: error: hpc-wn28: task 16: Killed
srun: error: hpc-wn26: task 10: Killed
srun: error: hpc-wn23: tasks 5-7: Killed
srun: error: hpc-wn23: task 4: Killed
slurmstepd: error: mpi/pmix_v3: _errhandler: hpc-wn27 [3]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.746141.0:12]
slurmstepd: error: mpi/pmix_v3: _errhandler: hpc-wn10 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.746141.0:0]
srun: error: hpc-wn27: task 14: Killed
srun: error: hpc-wn10: task 1: Killed
this is the script that I use to start the simulation
#!/bin/bash
#SBATCH --nodes=8 # number of nodes
#SBATCH --partition=gpus # partition name
#SBATCH --error=myJob.err
#SBATCH --gres=gpu:4
#SBATCH --output=myJob.out
#SBATCH --gpu-bind=closest
#SBATCH --ntasks-per-gpu=1
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
#pyfr -p import NLR.msh NLR.pyfrm
pyfr -p partition 32 -p scotch NLR.pyfrm .
srun -n 32 stdbuf -oL -eL pyfr -p run -b cuda NLR.pyfrm NLR.ini
while in the .ini file I use:
[backend]
precision = single
[backend-cuda]
mpi-type = cuda-aware
device-id = local-rank
I also used h5ls NLR.pyfrs
and the partition seems to be well balanced as shown below:
h5ls NLR_0.0001000000.pyfrs
config Dataset {SCALAR}
config-0 Dataset {SCALAR}
mesh_uuid Dataset {SCALAR}
soln_hex_p0 Dataset {64, 5, 191726}
soln_hex_p1 Dataset {64, 5, 191776}
soln_hex_p10 Dataset {64, 5, 195291}
soln_hex_p11 Dataset {64, 5, 192212}
soln_hex_p12 Dataset {64, 5, 195398}
soln_hex_p13 Dataset {64, 5, 192109}
soln_hex_p14 Dataset {64, 5, 195327}
soln_hex_p15 Dataset {64, 5, 194558}
soln_hex_p2 Dataset {64, 5, 194928}
soln_hex_p3 Dataset {64, 5, 192004}
soln_hex_p4 Dataset {64, 5, 192012}
soln_hex_p5 Dataset {64, 5, 192668}
soln_hex_p6 Dataset {64, 5, 195094}
soln_hex_p7 Dataset {64, 5, 192714}
soln_hex_p8 Dataset {64, 5, 194386}
soln_hex_p9 Dataset {64, 5, 192557}
stats Dataset {SCALAR}
(this is for 16 partition but is the same for 32).
Could it be related to the large number of elements per GPU?
Are there any recommended strategies for fixing this problem?
Thanks in advance for any help or insight