I am running the euler_vortex_2d case. I used the original mesh and .ini file.

It seemed to work fine with cuda:

“mpiexec -n 2 pyfr run -b cuda -p euler_vortex_2d.pyfrm euler_vortex_2d.ini” took about 1 minute
“mpiexec -n 4 pyfr run -b cuda -p euler_vortex_2d.pyfrm euler_vortex_2d.ini” took about 2 minutes 30 seconds
“mpiexec -n 8 pyfr run -b cuda -p euler_vortex_2d.pyfrm euler_vortex_2d.ini” took about 5mins

But, there seemed to be some problems with CPU parallelzation，

“mpiexec -n 2 pyfr run -b openmp -p euler_vortex_2d.pyfrm euler_vortex_2d.ini” took about 10 seconds
“mpiexec -n 4 pyfr run -b openmp -p euler_vortex_2d.pyfrm euler_vortex_2d.ini” required more than 7.5 hours
“mpiexec -n 8 pyfr run -b openmp -p euler_vortex_2d.pyfrm euler_vortex_2d.ini” required more than 20 hours

Why it required so long when I used more cpus? My computer has a 8 cores 16 procs AMD CPU and Nvidia RTX 3080

On the GPU side of things, we would recommend one MPI rank per GPU, regardless of which GPU supporting backend you use.

On the CPU side, that test case is quite small and there is a non-zero overhead associated with more MPI ranks. If there isn’t sufficient work to do, it can be hard to hide the communication latency. This is somewhat related to Amdahl's law - Wikipedia

Thanks, WillT, I understood the communication latency for small size problems, but it’s really surprising the time cost went up to a few hours from 10 seconds

Sorry I didn’t look too closely at the CPU run times, it is surprising that it went up quite that much, but it isn’t that concerning. Given that you running on a single CPU, at most you’ll want one MPI rank per NUMA node. So taking a guess at the specifics of you’re CPU, either 1 or 2 MPI ranks will be optimal for you. When running on CPU, PyFR uses a classic MPI/openMP hybrid model, the result is that I think all the ranks will be fighting over the same threads. Hence poor performance.

I will postfix this with the statement that there are people on this forum who know about the CPU backend then I do, so if I’ve got something wrong here I’m sure they’ll jump in.

Not really. You are launching a large number of MPI ranks each of which will spawn some number of threads to work on a tiny (several KB) data set. As such you’ll get cache line thrashing as the cores/threads fight over the same bit of memory.