Is using partitioning combined with CUDA the fastest computing method?

Hello everyone.

I am a newcomer to pyFR, but I want to grasp it quickly.

I’m attempting to run a three-dimensional case based on the paper “On the utility of GPU accelerated high-order methods forunsteady flow simulations: A comparison withindustry-standard tools”.

However, I’ve encountered some difficulties during the learning process, particularly regarding how to obtain computational results faster with pyFR.

My personal computer is equipped with an i7-12700 CPU with 12 cores and 20 threads, as well as an RTX 3090 GPU. The installation of pyFR is already configured, and CUDA and mpiexec are available.

However, my understanding of mpiexec is limited, and I’m unsure how the following statement coordinates the simultaneous utilization of the CPU and GPU:

pyfr import sd7003.msh sd7003.pyfrm

pyfr partition 20 sd7003.pyfrm .

mpiexec -n 20 pyfr run -b cuda -p sd7003.pyfrm sd7003.ini

Specifically, I’m unsure about the magnitude of the speed difference between the execution of the code above and the code below:

pyfr import sd7003.msh sd7003.pyfrm

pyfr run -b cuda -p sd7003.pyfrm sd7003.ini

Based on my limited experimentation, it appears that there isn’t a significant difference in speed between the two codes—they are both slow. As a newcomer without much experience, in such a situation, would it be best to use mpiexec to partition the maximum supported number of processes for the CPU and combine it with CUDA for computation? I’m unsure if this approach is the most efficient, and I’m curious about the advantages it offers over using CUDA alone in terms of principles.

I would greatly appreciate it if you could help clarify my doubts.

Best regards.

When running PyFR with the cuda backend you are best off running one mpi rank per GPU. When you use the cuda backend all of the compute happens on the GPU, with the CPU really only being used to run plugins.

The advantage of one rank per GPU is that you reduce the need for inter-rank communication (which can have high latency) and one rank each means you aren’t paying to launch multiple of the same kernel.

Got it!
But what do you meaning about “one mpi rank per GPU”?
Since my personal computer is not a server and I only have one RTX graphics card, does this mean that the partitioning code I mentioned earlier is not meaningful in this situation?

pyfr partition 20 sd7003.pyfrm .

I set the number of partitions to 20 because my intention was to utilize the maximum supported number of processes for my CPU, which is 20.
However, due to having only one GPU device, the partitioning is not meaningful. In this situation, the results with or without partitioning would be the same since I only have one GPU device, rather than 20.

Is my superficial understanding above correct? Thank you for understanding that, as a newcomer, I may have relatively basic questions. I appreciate it greatly!

Best wishes!

Yep.

As an example consider this. If were to run on a machine with 32 compute nodes, each with 3 GPUs, I would partition my mesh with 96 ranks, and then request 96 CPU cores each bound to a single GPU (go careful when doing this as you want to make sure that each CPU and GPU are directly corrected as you don’t want data to go via a socket-socket bus). I would then do mpiexec -n 96 pyfr run -b cuda ...

So what this means for you is that when running on your single GPU machine, you are best off just with a single rank. However, it can be worth running pyfr partition 1 ... as in this process linear elements will be flagged and this can save some compute time.

I feel extremely frustrated because I encountered issues while computing the sd7003 case provided in “On the utility of GPU accelerated high-order methods for unsteady flow simulations: A comparison with industry-standard tools”.
I downloaded the relevant resources based on Appendix A of the supplementary material in Redirecting.
Although there are some version discrepancies in the ini configuration file, I made some minor modifications and got it to work. Please note that I didn’t change any critical parameters.
However, to my dismay, after running the computation for nearly three days, I received the following error message:

(pyfr) lxy@LXY-LINUX:~/0_CODE/PyFR/3d-demo/sd7003$ pyfr import sd7003.msh sd7003.pyfrm
(pyfr) lxy@LXY-LINUX:~/0_CODE/PyFR/3d-demo/sd7003$ pyfr run -b cuda -p sd7003.pyfrm sd7003.ini 
  47.0% [==============================================>                                                   ] 23.52/50.00 ela: 65:29:45 rem: 73:43:25Traceback (most recent call last):
  File "/home/lxy/anaconda3/envs/pyfr/bin/pyfr", line 8, in <module>
    sys.exit(main())
  File "/home/lxy/anaconda3/envs/pyfr/lib/python3.10/site-packages/pyfr/__main__.py", line 118, in main
    args.process(args)
  File "/home/lxy/anaconda3/envs/pyfr/lib/python3.10/site-packages/pyfr/__main__.py", line 251, in process_run
    _process_common(
  File "/home/lxy/anaconda3/envs/pyfr/lib/python3.10/site-packages/pyfr/__main__.py", line 247, in _process_common
    solver.run()
  File "/home/lxy/anaconda3/envs/pyfr/lib/python3.10/site-packages/pyfr/integrators/base.py", line 115, in run
    self.advance_to(t)
  File "/home/lxy/anaconda3/envs/pyfr/lib/python3.10/site-packages/pyfr/integrators/std/controllers.py", line 198, in advance_to
    self._reject_step(dt, idxprev, err=err)
  File "/home/lxy/anaconda3/envs/pyfr/lib/python3.10/site-packages/pyfr/integrators/std/controllers.py", line 56, in _reject_step
    raise RuntimeError('Minimum sized time step rejected')
RuntimeError: Minimum sized time step rejected

Since this is a case I downloaded from the internet based on a research paper, I feel helpless.
Here are the screenshots of the error message, ini configuration file, and the computed residual.csv file. I haven’t provided the mesh file as it is large and can be downloaded from the paper’s appendix. I haven’t made any modifications to the grid.
Could anyone offer me any constructive advice? I would greatly appreciate it.

ini configuration file:

[backend]
precision = double

[constants]
gamma = 1.4
mu    = 3.94405318873308E-6
Pr    = 0.72
M     = 0.2

[solver-time-integrator]
scheme = rk45
controller = pi
tstart = 0.0
tend = 50.0
dt = 0.00001
atol = 0.000001
rtol = 0.000001
safety-fact = 0.5
min-fact = 0.3
max-fact = 1.2

[solver]
system = navier-stokes
order  = 4

[solver-interfaces]
riemann-solver = rusanov
ldg-beta = 0.5
ldg-tau = 0.1

[solver-interfaces-quad]
flux-pts = gauss-legendre

[solver-elements-hex]
soln-pts = gauss-legendre

[soln-bcs-outlet]
type = char-riem-inv
rho = 1.0
u   = 0.2366431913
v   = 0.0
w   = 0.0
p   = 1.0

[soln-bcs-inlet]
type = char-riem-inv
rho = 1.0
u   = 0.2366431913
v   = 0.0
w   = 0.0
p   = 1.0

[soln-bcs-wall]
type = no-slp-adia-wall
cpTw  = 3.5

[soln-ics]
rho  = 1.0
u    = 0.2366431913
v    = 0.001
w    = 0.001*cos(x)*cos(y)
p    = 1.0

[soln-plugin-residual]
nsteps = 100
file = residual.csv
header = true

[soln-plugin-writer]
dt-out = 0.2
basedir = .
basename = sd7003-{n:03d}

computed residual.csv file:

Have you looked at the dtstats? This might give a bit more insight into what is going on.