Error: CUDAOutofMemory on A30

Hello everybody.
I’m a newbie in pyfr and I have a reported error about cuda memory exceeded, hopefully I can get help in the community.
When I run the arithmetic on the cluster (A100 80G) there is no problem, the memory shows that it is using around 60G, but when I move the arithmetic to another cluster (A30*6 25G) I get the error pyfr.backends.cuda.driver.CUDAOutofMemory, I guess it has something to do with the specification of the cuda device, but I have no solution for this. I’m guessing it’s related to the cuda device designation, but I have no solution for this.
Best regards

[backend-cuda]
device-id = local-rank ;round-robin

Are you partitioning your grid and running with mpirun -np 6? Otherwise PyFR will only use a single GPU.

Regards, Freddie.

Hello, thank you very much for your reply.
I have tried mpiexec -n 6 pyfr run -b cuda -p x.pyfrm x.ini by splitting the mesh first, but it gives an error raise

RuntimeError(f'Mesh has {nparts} partitions but running '
RuntimeError: Mesh has 6 partitions but running with 1 MPI ranks

I tried mpirun -n 6 with the same error, i think it’s a mpi4py error because i initially installed mpi4py via conda, but i got no clue about installing it via pip (cf. this article,). Finally I have a couple of questions I would like to ask you, thank you for your advice.
Question 1: I think for gpu computing, one cpu core is enough, more cpu will add extra information communication, mpiexec -n 6 or mpirun are specifying 6 cpu cores, does it mean 6 gpu cards must have at least 6 cpu cores to use?
Question 2: I tested mpi4yr with this code and didn’t get the expected results, so I decided it was an installation problem with mpi4py.
from mpi4py import MPI

if name == “main”.
comm = MPI.COMM_WORLD
print(f “rank {comm.rank} of {comm.size}”)
comm.barrier()

rank 0 of 1
rank 0 of 1
rank 0 of 1

I tried disabling the mpi environment in the cuda suite and installing it directly with the default mpi.

  nvc-Error-Unknown switch: -Wno-unused-result
  nvc-Error-Unknown switch: -fwrapv
  warning: build_ext: command '/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/nvc' failed with exit code 1
  warning: build_ext: building optional extension "mpi4py.dl" failed
  checking for MPI compile and link ...

Unfortunately I couldn’t find the mpi4py.dl file

find / -name mpi4py.dl
find: ‘/run/user/1001/doc’: Permission denied
find: ‘/run/user/1001/gvfs’: Permission denied

It’s worth noting that I’m using a single A30 GPU to compute the small memory example with no problem, just as I did with the A100.
Regards, wgbb

I would try and get your MPI issues resolved outside of PyFR. It looks as though there is something not quite right about your mpi4py.

I’m not that familiar with conda, but what could’ve happened is that when you installed mpi4py conda also installed a different version of MPI that it depends on. I’m not sure how to fix this in conda.

But what I would do is use a python virtual environment. To do this go to a fresh terminal (ie not in a conda environment):

$ cd /path/to/where/you/want/the/venv
$ python3.1x -m venv test_venv
$ cd test_venv
$ . ./bin/activate
(test_venv) $ pip install mpi4py

With this venv active, you can then try and run an mpi4py test script:

from mpi4py import MPI

if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    print(f"rank {comm.rank} of {comm.size}")
    comm.barrier()

using

(test_vevn) $ mpiexec -n 3 python mpi_test.py

(I don’t think mpirun is actually in the mpi standard, but it is implemented by many of the standard libraries. As a result, the behaviour of mpirun can differ from library to library)

Thank you for your answer, avoiding the conda mechanism is a good option to successfully solve the mpi4py problem!

rank 0 of 3
rank 1 of 3
rank 2 of 3

But I’m still getting memory errors, and it’s clear that the program shouldn’t be using more than one gpu’s memory to deal with it:

pyfr.backends.cuda.pyfr.backends.cuda.driver.CUDAOutofMemory
[a30n3:1115216] 5 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[a30n3:1115216] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
[a30n3:1115216] 2 more processes have sent help message help-mpi-api.txt / mpi-abort

[backend-cuda] I tried round-robin and local-rank, round-robin and local-rank, but the problem still exists, I tried to specify other gpu numbers directly, e.g. device-id =1 is no problem.

[backend-cuda]
device-id = round-robin; local-rank.
mpi-type = cuda-aware ; standard

Also can you help me to explain this confusion?

Question 1: I think for gpu computing, one cpu core is enough, more cpu will add extra information communication, mpiexec -n 6 or mpirun are specifying 6 cpu cores, does it mean 6 gpu cards must have at least 6 cpu cores to use?

Regards, wgbb.

To answer your question, you want one rank per GPU, and one core per rank is normally fine. There are some optimisations you can do with core-gpu binding, but that isn’t worth thinking about at this stage.

Can you have a look at h5ls your_mesh_file.pyfrm. The various spt_{etype}_p{int} sizes will tell you how many elements there are per rank. It might be that your partitioner wasn’t very balanced in the way it shared the work and so one or more ranks are getting too much work to fit in memory.

Please use local-rank unless your GPUs are in compute exclusive mode.

Regards, Freddie.

Following the approach you provided, I looked at the chunking of the elements

spt_hex_p0               Dataset {8, 98850, 3}
spt_hex_p1               Dataset {8, 27780, 3}
spt_hex_p2               Dataset {8, 63080, 3}
spt_hex_p3               Dataset {8, 99050, 3}
spt_hex_p4               Dataset {8, 35690, 3}
spt_hex_p5               Dataset {8, 58600, 3}
spt_pri_p0               Dataset {6, 163679, 3}
spt_pri_p1               Dataset {6, 288533, 3}
spt_pri_p2               Dataset {6, 227997, 3}
spt_pri_p3               Dataset {6, 160784, 3}
spt_pri_p4               Dataset {6, 277378, 3}
spt_pri_p5               Dataset {6, 235529, 3}

and I’m guessing it’s this problem you’re talking about because I’m using a very sparse mesh dividing 6 chunks using 6 gpu calculations without any problems.

|===================================
|    0   N/A  N/A   1212100      C   
|    1   N/A  N/A   1212101      C   
|    2   N/A  N/A   1212102      C   
|    3   N/A  N/A   1212103      C   
|    4   N/A  N/A   1212104      C   
|    5   N/A  N/A   1212105      C   
|====================================

I am using pointwise to get the .msh mesh

pyfr partition 6 x.pyfrm .

Is there any way to get a more balanced mesh?
Maybe I can manually break the mesh for even chunking of the mesh?

Regards, wgbb.

yep if you do:

$ pyfr partition --help

You’ll see that there’s a way to pass weighting options for different element types. If you’re using metis as the partitioner, maybe try -e balanced.

1 Like

Can you confirm that everything is now working with 6 partitions and local-rank?

Regards, Freddie.

Thanks to WillT for the suggestion! This approach is very helpful, I used

pyfr partition -e hex:3 -e pri:2 6 test_3d.pyfrm .

After

mpiexec -n 6 pyfr run -b cuda -p test_3d.pyfrm 3d.ini

It’s working fine, using 6 gpu’s without any problems.

But I’m confused, I’m using the default partitions

pyfr partition 6 test_3d.pyfrm .

The partition looks like this

spt_hex_p0               Dataset {8, 98850, 3}
spt_hex_p1               Dataset {8, 27780, 3}
spt_hex_p2               Dataset {8, 63080, 3}
spt_hex_p3               Dataset {8, 99050, 3}
spt_hex_p4               Dataset {8, 35690, 3}
spt_hex_p5               Dataset {8, 58600, 3}
spt_pri_p0               Dataset {6, 163679, 3}
spt_pri_p1               Dataset {6, 288533, 3}
spt_pri_p2               Dataset {6, 227997, 3}
spt_pri_p3               Dataset {6, 160784, 3}
spt_pri_p4               Dataset {6, 277378, 3}
spt_pri_p5               Dataset {6, 235529, 3}

while using

pyfr partition -e hex:3 -e pri:2 6 test_3d.pyfrm .

The partition looks like this

/spt_hex_p0              Dataset {8, 63260, 3}
/spt_hex_p1              Dataset {8, 99050, 3}
/spt_hex_p2              Dataset {8, 31456, 3}
/spt_hex_p3              Dataset {8, 98850, 3}
/spt_hex_p4              Dataset {8, 25984, 3}
/spt_hex_p5              Dataset {8, 64450, 3}
/spt_pri_p0              Dataset {6, 227313, 3}
/spt_pri_p1              Dataset {6, 171636, 3}
/spt_pri_p2              Dataset {6, 275600, 3}
/spt_pri_p3              Dataset {6, 173773, 3}
/spt_pri_p4              Dataset {6, 280741, 3}
/spt_pri_p5              Dataset {6, 224837, 3}

It doesn’t look much different, the maximum number of elements in the partition is close, and I don’t quite understand what’s causing it to work properly, incidentally, -e balanced gives an error

ewts = {e: int(w) for e, w in (ew.split(':') for ew in args.elewts)}
ValueError: not enough values to unpack (expected 2, got 1)

Regards, wgbb.

Thanks to fdw’s suggestion,

I confirmed that it’s now local-rank and using GPU0-GPU5 normally via mpiexec -n 6 pyfr run -b cuda -p test_3d.pyfrm 3d.ini

Regards, wgbb.