CUDAOutofMemory Issues and Wall-Time on HPC

Luca_P · 5 August 2025 12:50

Hello everyone,

I’m following up on a previous thread I opened (regarding similar issues)https://pyfr.discourse.group/t/error-cudaoutofmemory/1264, but I’m starting this new topic because I’m now running a different simulation setup. The new case uses similar (but not identical) flow parameters — Mach number, Reynolds number, etc. — on a different airfoil. The meshes between the two cases are therefore comparable in size and structure, but not identical.

Here’s what I’ve done so far:

1. Initial Attempt (CUDA-aware, local-rank)

I first attempted to run the simulation with the following .sbatch file:

#!/bin/bash
#SBATCH --job-name=oat_t2
#SBATCH --nodes=5
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --output=out.out
#SBATCH --error=err.err
#SBATCH --partition=parallel

And the following backend configuration:

[backend]
precision = single
rank-allocator = linear

[backend-cuda]
mpi-type = cuda-aware
device-id = local-rank

However, I encountered a CUDA out of memory error — even when using up to 36 GPUs, unless the mesh was extremely coarse (e.g., order 1 with 1–2 layers in the spanwise direction).

2. More Complex Mesh (Order 2, 10 Spanwise Layers)

Since I’m interested in running more complex simulations, I then tested a mesh of order 2 with 10 layers in the spanwise direction. Based on suggestions from my previous thread, I also tried changing the backend settings to:

device-id = 0
mpi-type = standard

In this configuration, I requested 9 nodes with 4 GPUs each. The simulation started and ran, but only 1 GPU per node was utilized, which made the wall time very long.

I also tested running on 20 nodes with 1 GPU per node, and although the simulation launched, running nvidia-smirevealed that not all GPUs on all nodes were being used.

For this 20-partition case, the .pyfrm file appears to be correctly partitioned. Here is the output from h5ls oat.pyfrm: bcon_inlet_p12 Dataset {180}
bcon_outlet_p12 Dataset {60}
bcon_wall_p0 Dataset {3440}
bcon_wall_p1 Dataset {3520}
bcon_wall_p10 Dataset {2860}
bcon_wall_p11 Dataset {2500}
bcon_wall_p13 Dataset {2040}
bcon_wall_p14 Dataset {2500}
bcon_wall_p15 Dataset {3240}
bcon_wall_p16 Dataset {3400}
bcon_wall_p17 Dataset {3540}
bcon_wall_p18 Dataset {3620}
bcon_wall_p19 Dataset {3320}
bcon_wall_p2 Dataset {3680}
bcon_wall_p4 Dataset {720}
bcon_wall_p5 Dataset {3140}
bcon_wall_p6 Dataset {2840}
bcon_wall_p7 Dataset {3430}
bcon_wall_p8 Dataset {3490}
bcon_wall_p9 Dataset {1020}
con_p0 Dataset {2, 661221}
con_p0p1 Dataset {944}
con_p0p4 Dataset {680}
con_p0p7 Dataset {900}
con_p1 Dataset {2, 661083}
con_p10 Dataset {2, 661405}
con_p10p11 Dataset {1322}
con_p10p12 Dataset {400}
con_p10p15 Dataset {1020}
con_p11 Dataset {2, 661582}
con_p11p10 Dataset {1322}
con_p11p12 Dataset {20}
con_p11p14 Dataset {1406}
con_p12 Dataset {2, 662134}
con_p12p10 Dataset {400}
con_p12p11 Dataset {20}
con_p12p13 Dataset {922}
con_p12p14 Dataset {382}
con_p12p15 Dataset {280}
con_p12p16 Dataset {220}
con_p12p17 Dataset {240}
con_p12p18 Dataset {180}
con_p12p19 Dataset {220}
con_p12p3 Dataset {330}
con_p12p4 Dataset {200}
con_p12p6 Dataset {120}
con_p12p9 Dataset {390}
con_p13 Dataset {2, 661451}
con_p13p12 Dataset {922}
con_p13p14 Dataset {1280}
con_p13p6 Dataset {1280}
con_p14 Dataset {2, 661416}
con_p14p11 Dataset {1406}
con_p14p12 Dataset {382}
con_p14p13 Dataset {1280}
con_p15 Dataset {2, 661432}
con_p15p10 Dataset {1020}
con_p15p12 Dataset {280}
con_p15p16 Dataset {1002}
con_p16 Dataset {2, 661424}
con_p16p12 Dataset {220}
con_p16p15 Dataset {1002}
con_p16p17 Dataset {942}
con_p17 Dataset {2, 661378}
con_p17p12 Dataset {240}
con_p17p16 Dataset {942}
con_p17p18 Dataset {934}
con_p18 Dataset {2, 661317}
con_p18p12 Dataset {180}
con_p18p17 Dataset {934}
con_p18p19 Dataset {1044}
con_p19 Dataset {2, 661333}
con_p19p12 Dataset {220}
con_p19p18 Dataset {1044}
con_p19p2 Dataset {1030}
con_p19p4 Dataset {132}
con_p1p0 Dataset {944}
con_p1p4 Dataset {1782}
con_p2 Dataset {2, 661331}
con_p2p19 Dataset {1030}
con_p2p4 Dataset {1034}
con_p3 Dataset {2, 662153}
con_p3p12 Dataset {330}
con_p3p4 Dataset {1962}
con_p3p9 Dataset {1814}
con_p4 Dataset {2, 660521}
con_p4p0 Dataset {680}
con_p4p1 Dataset {1782}
con_p4p12 Dataset {200}
con_p4p19 Dataset {132}
con_p4p2 Dataset {1034}
con_p4p3 Dataset {1962}
con_p4p7 Dataset {280}
con_p4p9 Dataset {580}
con_p5 Dataset {2, 661451}
con_p5p6 Dataset {1122}
con_p5p9 Dataset {1242}
con_p6 Dataset {2, 661424}
con_p6p12 Dataset {120}
con_p6p13 Dataset {1280}
con_p6p5 Dataset {1122}
con_p6p9 Dataset {160}
con_p7 Dataset {2, 661105}
con_p7p0 Dataset {900}
con_p7p4 Dataset {280}
con_p7p8 Dataset {1032}
con_p7p9 Dataset {560}
con_p8 Dataset {2, 661074}
con_p8p7 Dataset {1032}
con_p8p9 Dataset {1742}
con_p9 Dataset {2, 660455}
con_p9p12 Dataset {390}
con_p9p3 Dataset {1814}
con_p9p4 Dataset {580}
con_p9p5 Dataset {1242}
con_p9p6 Dataset {160}
con_p9p7 Dataset {560}
con_p9p8 Dataset {1742}
mesh_uuid Dataset {SCALAR}
spt_hex_p0 Dataset {27, 221401, 3}
spt_hex_p1 Dataset {27, 221402, 3}
spt_hex_p10 Dataset {27, 221402, 3}
spt_hex_p11 Dataset {27, 221402, 3}
spt_hex_p12 Dataset {27, 221402, 3}
spt_hex_p13 Dataset {27, 221404, 3}
spt_hex_p14 Dataset {27, 221400, 3}
spt_hex_p15 Dataset {27, 221401, 3}
spt_hex_p16 Dataset {27, 221402, 3}
spt_hex_p17 Dataset {27, 221402, 3}
spt_hex_p18 Dataset {27, 221402, 3}
spt_hex_p19 Dataset {27, 221402, 3}
spt_hex_p2 Dataset {27, 221401, 3}
spt_hex_p3 Dataset {27, 221402, 3}
spt_hex_p4 Dataset {27, 221402, 3}
spt_hex_p5 Dataset {27, 221401, 3}
spt_hex_p6 Dataset {27, 221395, 3}
spt_hex_p7 Dataset {27, 221402, 3}
spt_hex_p8 Dataset {27, 221402, 3}
spt_hex_p9 Dataset {27, 221403, 3}

So it seems that all 20 partitions (p0 to p19) are present and connected as expected.

3. Back to CUDA-aware + local-rank + More CPUs

I then returned to the previous backend setup:

device-id = local-rank
mpi-type = cuda-aware

and increased --cpus-per-task=6. With this setup, the simulation finally launched correctly ( with a wall time of 95h) and utilized all GPUs across the requested nodes.

However, the simulation time is still long and hard to manage, especially considering I’m trying to run high-fidelity cases. While requesting more computational resources is possible, it often involves significant queueing delays on my HPC system.

Questions

Given all this, I’d appreciate advice on the following:

Are there any optimizations (either PyFR-specific or general CUDA/MPI/HPC strategies) that could help reduce simulation time or memory usage in such cases?
Would increasing the number of CPUs per task improve GPU utilization and performance?
Or is the CUDA out of memory error likely caused by something else, such as mesh partitioning, load imbalance, or backend configuration?

Additionally, I plan to run even heavier meshes in future simulations, so any suggestions that would scale well with increasing mesh complexity would be especially helpful.

Any insight or guidance would be highly appreciated.

Best regards,
Luca_P

fdw · 5 August 2025 14:41

In general you should not find yourself running out of memory. Simulations that are anywhere close to the memory limit will typically not finish in a sensible amount of time (as you’ve found). Increasing the number of CPUs per task will have no impact on the performance. Instead you’ll want to simply throw more GPUs at the problem. PyFR should be able to keep scaling until around ~3,000 elements per GPU.

Regards, Freddie.

Topic		Replies	Views
Error: CUDAOutofMemory Errors cuda	5	56	23 May 2025
Cuda backend error ‘CudaOutOfMemory’ ‘CudaInvaildDevice’ General	9	597	18 July 2022
Error: CUDAOutofMemory on A30 Errors config , cuda	11	253	10 February 2024
GPU parallelization and scalability Cases hpc , cuda	2	42	7 April 2025
Memory on CUDA GPU backend General	14	1072	18 November 2021

CUDAOutofMemory Issues and Wall-Time on HPC

1. Initial Attempt (CUDA-aware, local-rank)

2. More Complex Mesh (Order 2, 10 Spanwise Layers)

3. Back to CUDA-aware + local-rank + More CPUs

Questions

Related topics