TGV Performance Numbers

mlaufer · 24 August 2021 19:58

Sorry for the late response here. We had a memory configuration issue with our AMD Node that was killing performance.

I ran tests on a dual socket AMD Epyc 7H12 (64c, 8 memory channels) node, which is the top of line Rome SKU. For reference this SKU is claimed to approach the performance of the top bin Milan CPUs (within about 15%). I also tested on an Intel Xeon Gold 6240 (18c, 6 memory channels) node.

Both installations used PyFR v1.12.2, GCC v10.2.0, libxsmm (at commit suggested in the performance guide), Open MPI v4.1.1, UCX v1.11.0. All packages were compiled from source using Spack and built to target the specific microarchitecture.

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled. 1 MPI rank per socket was used.

AMD EPYC 7H12

	Single	Double
Runtime [s]	943	2058
DoF	(40\times4)^3	(40\times4)^3
RHS	4\times10^4	4\times10^4
No. CPUs	2	2
s/DoF/RHS/CPU	2.88\times10^{-9}	6.28\times10^{-9}

Intel Xeon Gold 6240

	Single	Double
Runtime [s]	1740	3090
DoF	(40\times4)^3	(40\times4)^3
RHS	4\times10^4	4\times10^4
No. CPUs	2	2
s/DoF/RHS/CPU	5.31\times10^{-9}	9.43\times10^{-9}

Summarizing the results, the AMD Rome, Zen2 based node outperforms the Intel Cascade Lake node by ~50% at DP and ~100% at SP for this case. As AMD holds a 3.5X core count advantage as well as an additional 33% more memory channels, I expected it to further outperform the Intel Node. It is possible that further optimizations could push AMD further ahead, like using their own AOCC compiler as well as finely tuned OpenMP thread placement.

Compared to the A100 GPU, assuming ideal scaling, ~8 Zen2 CPUs (4 dual socket nodes) would be needed to match the performance of a single A100 card for SP performance.

@WillT, could you re-verify the conclusions from your M1 performance blog post? My quick calculations shows that 21 M1 chips would be needed to match the SP performance of an A100 and not 4.6. Could be mistaken of course.Thanks.

Mike

Topic		Replies	Views
Poor TGV performance with GPU and MPI Cases hpc , cuda	5	238	1 June 2023
TGV performance General	1	155	8 August 2019
Scaling studies for isentropic vortex General	4	173	10 March 2017
GPU parallelization and scalability Cases hpc , cuda	2	36	7 April 2025
Questions on the TGV case General	35	948	22 March 2022

TGV Performance Numbers

Related topics