TGV Performance Numbers

Sorry for the late response here. We had a memory configuration issue with our AMD Node that was killing performance.

I ran tests on a dual socket AMD Epyc 7H12 (64c, 8 memory channels) node, which is the top of line Rome SKU. For reference this SKU is claimed to approach the performance of the top bin Milan CPUs (within about 15%). I also tested on an Intel Xeon Gold 6240 (18c, 6 memory channels) node.

Both installations used PyFR v1.12.2, GCC v10.2.0, libxsmm (at commit suggested in the performance guide), Open MPI v4.1.1, UCX v1.11.0. All packages were compiled from source using Spack and built to target the specific microarchitecture.

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled. 1 MPI rank per socket was used.

  1. AMD EPYC 7H12
Single Double
Runtime [s] 943 2058
DoF (40\times4)^3 (40\times4)^3
RHS 4\times10^4 4\times10^4
No. CPUs 2 2
s/DoF/RHS/CPU 2.88\times10^{-9} 6.28\times10^{-9}
  1. Intel Xeon Gold 6240
Single Double
Runtime [s] 1740 3090
DoF (40\times4)^3 (40\times4)^3
RHS 4\times10^4 4\times10^4
No. CPUs 2 2
s/DoF/RHS/CPU 5.31\times10^{-9} 9.43\times10^{-9}

Summarizing the results, the AMD Rome, Zen2 based node outperforms the Intel Cascade Lake node by ~50% at DP and ~100% at SP for this case. As AMD holds a 3.5X core count advantage as well as an additional 33% more memory channels, I expected it to further outperform the Intel Node. It is possible that further optimizations could push AMD further ahead, like using their own AOCC compiler as well as finely tuned OpenMP thread placement.

Compared to the A100 GPU, assuming ideal scaling, ~8 Zen2 CPUs (4 dual socket nodes) would be needed to match the performance of a single A100 card for SP performance.

@WillT, could you re-verify the conclusions from your M1 performance blog post? My quick calculations shows that 21 M1 chips would be needed to match the SP performance of an A100 and not 4.6. Could be mistaken of course.Thanks.

Mike

1 Like