I’m a new user of PyFR. I started by running the 2D benchmark cases provided with the source files and then moved up to the supplementary materials of journal publications.
I’m currently running the 3D sd7003 case, files are attached. At first, I am using p1 to initialize my domain. Then starting at t=25s, I switch the p3 with the multi-p method until t=32s. Lastly, statistics are gathered until t=45s. A colleague ran the same case using 2 V100 GPUs and it took him about 4 days for the full simulation.
I am using 9 nodes of 2x Intel Xeon 6540 (18 cores per CPU, 2 CPU per node, 324 cores total) and therefore running with the OpenMP backend. From previous posts, I’ve read that PyFR runs best with one MPI rank per CPU and I’m therefore using with OMP_NUM_THREADS=18.
The p1 case (sd7003_1.ini) took about 24 hours to run. However, after 36 hours of running the first p3 case (sd7003_2.ini), the estimated time is about 10 days. Extrapolating, this would mean the entire run would take about 28 days.
I don’t have experience running with GPUs but I was expecting each GPU to match the performance of 3-4 CPUs. The above indicates a ratio of about 60 and does not seem right.
- I know PyFR runs very well on GPUs but are these results to be expected?
- I’ve been running a case that was meant to run on GPUs. Perhaps other numerical methods are more adapted to CPUs / OpenMP backend?
sd7003_1.ini (2.6 KB)
sd7003_2.ini (2.68 KB)
sd7003_3.ini (2.67 KB)
It looks like you are using the incompressible solver.
Was your colleague also using this solver (just so we can be sure that we are comparing apples with apples).
Yes, for comparison purposes, the only thing I’ve modified was the OpenMP backend.
So as a starting point it might be worth doing a comparison of expected theoretical peak FLOPs and bandwidth between the two platforms (2 x V100s vs 9 x Xeons). There is some information here that might serve as a starting point:
My rough guess is that you will have double the theoretical peak FLOPS from the 2 x V100s cf. the 9 x Xeons, but that’s just a guess.
Another thing to do would be to benchmark your BLAS install and check it is performing optimally.
Thanks for the links and pointers.
From the information provided by microway:
- 9x Intel 6540 = 11.25 TFlops (CPU taken at median flops)
- 2x V100 = 14-16 TFlops.
So theoretically, the 2 GPUs should offer better performance, but not as much as I’ve experienced. The issue lies somewhere else.
I’ll start profiling and see if the MPI isn’t an issue (shouldn’t be with only 18 ranks). I’ll also benchmark my BLAS to see how it performs with respect to other measurements found online. From what I understand, as PyFR is written in Python, it heavily relies on BLAS for compute performance.
Thanks for the help,
From the information provided by microway:
* 9x Intel 6540 = 11.25 TFlops (CPU taken at median flops)
* 2x V100 = 14-16 TFlops.
So theoretically, the 2 GPUs should offer better performance, but not as
much as I've experienced. The issue lies somewhere else.
I'll start profiling and see if the MPI isn't an issue (shouldn't be
with only 18 ranks). I'll also benchmark my BLAS to see how it performs
with respect to other measurements found online. From what I understand,
as PyFR is written in Python, it heavily relies on BLAS for compute
So a few things to check. First is the compiler. Sometimes I've got
better results with ICC than GCC (but always be sure to use the latest
version). Secondly, I think that this case (where anti-aliasing is
disabled) is limited not by FLOP/s but by memory bandwidth. Thus PyFR
will probably be using GiMMiK rather than vendor BLAS on both platforms.
On CPUs one thing you can do to improve performance is to make libxsmm
available on the shared library path. If available, PyFR will call into
this for sparse (and dense) BLAS and it tends to outperform everything else.
Another thing to check is that the OpenMP threads are not all getting
pinned to the same core. This can happen with some combinations of
OpenMP runtimes and MPI libraries. One thing you might want to try here
is running one MPI rank per core (with OMP_NUM_THREADS=1) and seeing if
this makes a difference.
Thanks for the pointers.
Profiling showed that the MPI communication was actually the bottleneck. Apparently, the communication was going through the (very slow) ethernet port by default and therefore slowing down the simulation overall. After specifying the correct interface - Infiniband - the simulation got about 5-6 times faster. MPI time fell back to a reasonable 15%, mostly waiting for Isend/Irecv to complete.
I’ve started simulating the same geometry but with the compressible formulation, as it runs much faster. I ran the case (input file attached) with a series of combination of MPI ranks (from 1 per core to 1 per CPU) and OpenMP threads. I found that this had essentially no effect on the wall time, even though the MPI comm time increases with the number of MPI ranks (from 11% to 32%).
sd7003_2.ini (1.61 KB)