CPU parallelization problem for tutorial euler case

verseau · 26 August 2021 12:31

Hi,

I am running the euler_vortex_2d case. I used the original mesh and .ini file.

It seemed to work fine with cuda:

mpiexec -n 2 pyfr run -b cuda -p euler_vortex_2d.pyfrm euler_vortex_2d.ini took about 1 minute
mpiexec -n 4 pyfr run -b cuda -p euler_vortex_2d.pyfrm euler_vortex_2d.ini took about 2 minutes 30 seconds
mpiexec -n 8 pyfr run -b cuda -p euler_vortex_2d.pyfrm euler_vortex_2d.ini took about 5mins

But, there seemed to be some problems with CPU parallelzation，

mpiexec -n 2 pyfr run -b openmp -p euler_vortex_2d.pyfrm euler_vortex_2d.ini took about 10 seconds
mpiexec -n 4 pyfr run -b openmp -p euler_vortex_2d.pyfrm euler_vortex_2d.ini required more than 7.5 hours
mpiexec -n 8 pyfr run -b openmp -p euler_vortex_2d.pyfrm euler_vortex_2d.ini required more than 20 hours

Why it required so long when I used more cpus? My computer has a 8 cores 16 procs AMD CPU and Nvidia RTX 3080

Thanks.

WillT · 26 August 2021 13:43

On the GPU side of things, we would recommend one MPI rank per GPU, regardless of which GPU supporting backend you use.

On the CPU side, that test case is quite small and there is a non-zero overhead associated with more MPI ranks. If there isn’t sufficient work to do, it can be hard to hide the communication latency. This is somewhat related to Amdahl's law - Wikipedia

verseau · 27 August 2021 01:00

Thanks, WillT, I understood the communication latency for small size problems, but it’s really surprising the time cost went up to a few hours from 10 seconds

WillT · 27 August 2021 02:09

Sorry I didn’t look too closely at the CPU run times, it is surprising that it went up quite that much, but it isn’t that concerning. Given that you running on a single CPU, at most you’ll want one MPI rank per NUMA node. So taking a guess at the specifics of you’re CPU, either 1 or 2 MPI ranks will be optimal for you. When running on CPU, PyFR uses a classic MPI/openMP hybrid model, the result is that I think all the ranks will be fighting over the same threads. Hence poor performance.

I will postfix this with the statement that there are people on this forum who know about the CPU backend then I do, so if I’ve got something wrong here I’m sure they’ll jump in.

fdw · 27 August 2021 11:26

Not really. You are launching a large number of MPI ranks each of which will spawn some number of threads to work on a tiny (several KB) data set. As such you’ll get cache line thrashing as the cores/threads fight over the same bit of memory.

Regards, Freddie.

Topic		Replies	Views
Error: CUDAERROR euler with MPI Errors cuda	5	337	1 July 2022
GPU parallelization error, ranks not equal to gpu number Errors cuda , mpi	23	1266	20 September 2023
How many GPU cores am I using? General	3	218	25 June 2015
Scaling studies for isentropic vortex General	4	172	10 March 2017
RuntimeError: Mesh has 2 partitions but running with 1 MPI ranks, with OpenMPI General	1	234	24 May 2015

CPU parallelization problem for tutorial euler case

Related topics