GPU parallelization and scalability

I was performing a simulation of the 2D flow around an airfoil on an HPC. I’m using a 3rd order type C mesh with a total number of nodes equal to 350964,also order=3 in the [solver] option and a dt=1e-7 with euler scheme.
I tried different number of partition and backend-cuda options in my .ini file but wall time seems not to be affected by those variations. I also tried different values of cublas-nkerms,gimmik-nkerns and gimmik-nbench from default without results.
On this HPC every nodes has 4 GPU NVIDIA V100 32GB.Here I show the different configurations and the results of the different partition used:



I think this could be related to the relatively simple case I’m performing, but it was wondering if it is possible to improve GPU utilization in order to reduce resources needed for the simulation and understand more about how PyFR is affected by these factors.

GPU availability isn’t the best metric for understanding how efficiently you use resources, as this is derived from the number of SMs that are busy. Instead, we tend to use the GDoF/s/GPU metric for PyFR. For a 3D compressible case, the DoF is number of solution pointsnumber of RK substeps5.

How many elements are in your mesh, and what type of elements are in it? I’m assuming that you can just about fit the full case on a single GPU, can you try doing a strong scaling run starting at 1 GPU and working your way up?

You can use the solution writer plugin to get the wall time more accurately. The pyfrs files output by pyfr are just hdf5 files containing a dataset called stats. To get the run time, you can do a short run and set up the solution writer to output a few solution files. The use the second and third file say, and the command $ h5dump -d stats <your solution file> to get the runtime. I say second and third files as the first one will contain the compilation time. PyFR even reports the number of RHS evaluations so you can more accurately calculate the DoF.

As for the device ID, in most cases, local-rank is what you want. This will mean that rank 0 on a node will use the device 0, and so on. Also, if you have compiled MPI to be cuda aware, you probably want cuda aware mpi as on modern NVIDIA systems this is significantly faster. You may need to fiddle with the binding when launching pyfr with mpiexec to ensure the CPU core used by rank 0 is closest to device 0. This is a handy tool for that: binding/binder.sh at 73ea6a2e7e37ec702cb6371e5c88022a1711523a · LStuber/binding · GitHub

You want to assign one rank to each GPU. This means having as many mesh partitions as there are GPUs and using the local-rank option (which I think is the default). I would not recommend changing any other settings.

Regards, Freddie.