TGV Performance Numbers

I was recently asked how Ampere GPUs compare to CPU, I have previously given some performance numbers, but decided to make more of a benchmark case for comparison.

I ran a 40^3 hex mesh with p=3, which gives about 4.1 million solution points. The case I used was of course the TGV and below is the ini file. Given that we aren’t interested in the physics here, I only ran it to t=5.

These results are using PyFR v 1.10, on a single A100, using CUDA 11.0.2 and GCC 9.2.0.

Single Double
Runtime [s] 2069.735 3183.357
DoF (40\times4)^3 (40\times4)^3
RHS 2\times10^5 2\times10^5
No. GPUs 1 1
s/DoF/RHS/GPU 2.53\times10^{-9} 3.89\times10^{-9}

Below is the ini file for double precision and a link to the mesh.

If anyone else wants to run the case for comparison I would be interested to see the performance numbers, I only really have access to GPU orientated machines.

[backend]
precision = double
rank-allocator = linear

[backend-cuda]
device-id = local-rank
# gimmik-max-nnz is not required in later pyfr versions
gimmik-max-nnz = 512
mpi-type = cuda-aware
block-1d = 64
block-2d = 128

[constants]
gamma = 1.4
mu = 6.25e-4
Pr = 0.71
Ps = 111.607

[solver]
system = navier-stokes
order = 3
anti-alias = none
viscosity-correction = none
shock-capturing = none

[solver-time-integrator]
scheme = rk4
controller = none
tstart = 0
tend = 5.01
dt = 1e-4

[solver-interfaces]
riemann-solver = rusanov
# These ldg settings will hit the interface communication a little harder
ldg-beta = 0.0
ldg-tau = 0.0

[solver-interfaces-quad]
flux-pts = gauss-legendre
quad-deg = 6
quad-pts = gauss-legendre

[solver-elements-hex]
soln-pts = gauss-legendre
quad-deg = 6
quad-pts = gauss-legendre

[soln-plugin-nancheck]
nsteps = 10

[soln-plugin-integrate]
nsteps = 50
file = integral_fp64.csv
header = true
vor1 = (grad_w_y - grad_v_z)
vor2 = (grad_u_z - grad_w_x)
vor3 = (grad_v_x - grad_u_y)

int-E = rho*(u*u + v*v + w*w)
int-enst = rho*(%(vor1)s*%(vor1)s + %(vor2)s*%(vor2)s + %(vor3)s*%(vor3)s)

[soln-plugin-writer]
dt-out = 5
# depending on the environment you might want to change this
basedir = .
basename = nse_fp64_tgv_3d_p3-{t:.2f}

[soln-ics]
rho = 1
u = sin(x)*cos(y)*cos(z)
v = -cos(x)*sin(y)*cos(z)
w = 0
p = Ps + (1.0/16.0)*(cos(2*x) + cos(2*y))*(cos(2*z + 2))

Google drive link to gmsh mesh file

3 Likes

FYI the runtime and RHS number come fro the solution file using a command similar to $ h5dump -d stats nse_tgv_3d_p3-5.00.pyfrs

1 Like

I am unsure about these results. For reference, running with a hex mesh on a V100 Semih and myself have found:


Where the axis is in Giga DOF/s which corresponds to the reciprocal of your numbers in ns. I am also unsure about the factor of 4 in the DOF calculation (should it not be 5?)

Regards, Freddie.

The 4 is from p+1, when you do your calculation do you count each variable as a DoF? If so this should be 5\times(40\times4)^3

Ah, you’re running p = 3 not p = 4. However, I am still puzzled. My V100 p = 3 results are at 1.61 GDOF/s including the factor of 5 from the variables. Converting to your units we have 1/(1.6e9 / 5) = 3.1e-9. Thus a V100 is comfortably outperforming an A100 here.

Regards, Freddie.

Hmm, did you get 1.61 GDoF/s for this case?

Yes, which is in-line with what ZEFR gets (see Table 6 of https://www.sciencedirect.com/science/article/pii/S0010465520300229).

Regards, Freddie.

Interesting, on a V100 with single precision I get a runtime of 3123.431s, giving 3.81ns/DoF/RHS.

So in your units I get:

  • A100 FP32: 1.98 GDoF/s
  • V100 FP32: 1.31 GDoF/s

In our V100 machine (single GPU) we obtain the following in the TGV test case with RK45 time integration and double precision (data measured in ns/(Nvar * DoF)/RHS/GPU)

image

which is line with data of ZEFR as Freddie stated (they obtain 0.6 with p=3 and 0.65 with p=4). If I transform my data data units to yours (i.e by multiplying by 5) I obtain 3.35 ns/DoF/RHS/GPU. This indicates that V100 GPUs might be more performant than A100 GPUs in this case.

PS: Are 2e5 RHS evaluations needed to measure the performance? some averaging over hundreds of RHS evaluations after an arbitrary warm up period was sufficient in my tests. BTW does your runtime measurement include the initialization process? (it might be negligible since you have performed 2e5 RHS evaluations). I wrote a while ago a small performance plugin which measures the same performance parameter after a given number of time step evaluations. I might write a small PR with it in the future.

1 Like

I’d like to know what does the RHS in this case represent for? Is it the initial vorticity? How is it calculated?

RHS is what we use to refer to the spatial scheme. In this case, as we are using Navier–Stokes, this comes from the baseadvecdiff system class. See here.

We use this notation as you typically see a PDE written as:

\frac{\partial u}{\partial t} + \frac{\partial f(u)}{\partial x} = s

which then to apply the method of lines we rearrange to

\frac{\partial u}{\partial t} = -\frac{\partial f(u)}{\partial x} + s

The right-hand side in PyFR is handled using FR.

In other news, Freddie and I got to the bottom of the different performance number. The in the ini file I provided I was both using the integrate plugin and frequently using the NaN checker. As the NaN check transfers the solution from the GPU to the CPU via PCIe to then do a summation, this was really hitting performance. When I have some time in the next few days I will re-run the A100 tests.

I reran the case, for slightly less time (tend=1), and also turned off the integrate and nan checker plugins.

Here are the results for an A100 at p=3, for a 40^3 element mesh.

Single Double
Runtime [s] 216.256 355.794
DoF (40\times4)^3 (40\times4)^3
RHS 4\times10^4 4\times10^4
No. GPUs 1 1
s/DoF/RHS/GPU 1.320\times10^{-9} 2.172\times10^{-9}

Converting these to Freddie’s units this gives:

  • FP32 3.788GDoF/s
  • FP64 2.302GDoF/s

So the A100, as expected compares favourably with the V100, but given that the bandwidth is ~50% higher on the A100 I would’ve expected more.

Now either the case is too small to keep the GPU busy or some kernel isn’t performing well on the new hardware. I have some more data on this but this is getting a little out of scope for this topic, so will start a new thread in due course to discuss this further, if people are interested.

The moral of the story here is that the plugins can have significant overhead.
(We might want to think about moving the nan checker to device for GPU to avoid the PCIe bottleneck).

2 Likes

Regarding CPU backends for PyFR: given the current cost of an A100, I got to wondering how many 32-core Zen CPUs would it take to match a single A100. For non-steady compressible NS, if we also factor in larger time steps for implicit steppers, benchmarking gets tricky, but interesting!

Yeah, I’d like to know how their performance stacks up. Sadly, I don’t have access to any AMD CPU hardware, but shortly I should be able to let you know some MI100 performance numbers. If you’re interested you might like to take a look at this in the meantime, I show some performance numbers for the Mac M1 SoC.

At the end of the day though PyFR is most often memory bandwidth bound, so a simple ratio of bandwidth will give you a reasonable first order approximation of peak performance.

1 Like

Will,
The M1 numbers are seriously impressive for a sub 15W chip.
A couple of questions:

  1. What is the measured power usage for the A100 during the run (from nvidia-smi)?
  2. What Apple model are you running? Macbook Air? Pro? Mac Mini? Imac?

In regards to CPU performance I wil try to setup the cases with HW I have access to. We have a dual socket 32 core AMD Rome CPU node, as well as dual socket 18 core Intel Cascade Lake CPU nodes.
Intel will be able to utilize libxsmm and AVX512, while AMD will have the core count advantage. It will be an interesting comparison.

1 Like

From nvidia-smi the temperature was in the range of 230-240W (out of a max of 250W). But that doesn’t account for cooling etc, this is purely the device. The temperature of the GPU before I started was approx 30C and stabilised at 64/65C. So a significant amount of heat is having to be gotten rid of, and the ambient temperature of Teaxs, where the machine is located, won’t be helping.

The M1 results were obtained on a MacBook Pro 16GB, with the only real difference for this application being that this model has active cooling.

If you could run this benchmark, I would be interested to know what the performance is on the hardware you have access to.

Some things to note are that, Freddie recently made a PR to libxsmm that might help the tensor product element performance here. Also, Semih has been doing some great cache blocking work for CPU, which will help things, but that isn’t mainline ready quite yet.

1 Like

Sorry for the late response here. We had a memory configuration issue with our AMD Node that was killing performance.

I ran tests on a dual socket AMD Epyc 7H12 (64c, 8 memory channels) node, which is the top of line Rome SKU. For reference this SKU is claimed to approach the performance of the top bin Milan CPUs (within about 15%). I also tested on an Intel Xeon Gold 6240 (18c, 6 memory channels) node.

Both installations used PyFR v1.12.2, GCC v10.2.0, libxsmm (at commit suggested in the performance guide), Open MPI v4.1.1, UCX v1.11.0. All packages were compiled from source using Spack and built to target the specific microarchitecture.

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled. 1 MPI rank per socket was used.

  1. AMD EPYC 7H12
Single Double
Runtime [s] 943 2058
DoF (40\times4)^3 (40\times4)^3
RHS 4\times10^4 4\times10^4
No. CPUs 2 2
s/DoF/RHS/CPU 2.88\times10^{-9} 6.28\times10^{-9}
  1. Intel Xeon Gold 6240
Single Double
Runtime [s] 1740 3090
DoF (40\times4)^3 (40\times4)^3
RHS 4\times10^4 4\times10^4
No. CPUs 2 2
s/DoF/RHS/CPU 5.31\times10^{-9} 9.43\times10^{-9}

Summarizing the results, the AMD Rome, Zen2 based node outperforms the Intel Cascade Lake node by ~50% at DP and ~100% at SP for this case. As AMD holds a 3.5X core count advantage as well as an additional 33% more memory channels, I expected it to further outperform the Intel Node. It is possible that further optimizations could push AMD further ahead, like using their own AOCC compiler as well as finely tuned OpenMP thread placement.

Compared to the A100 GPU, assuming ideal scaling, ~8 Zen2 CPUs (4 dual socket nodes) would be needed to match the performance of a single A100 card for SP performance.

@WillT, could you re-verify the conclusions from your M1 performance blog post? My quick calculations shows that 21 M1 chips would be needed to match the SP performance of an A100 and not 4.6. Could be mistaken of course.Thanks.

Mike

1 Like

@mlaufer yep you are quite right on the M1 chip numbers. I previously made an error but clearly forgot to propagate it to that number. It should be ~21 M1s to one A100.

The AMD performance is held back by the lack of support for AVX-512. Specifically, GCC is unable (or unwilling) to vectorise the interface flux kernel on AVX2 CPUs. This represents a big hit, especially at single precision where the vectorisation is even more important. Additionally, libxsmm only has limited support for sparse kernels on AVX2 systems (and the only reason it exists at all is because I backported the AVX-512 code). For the TGV case at a reasonable order this might not be an issue, however certainly at higher orders libxsmm is forced to fall back to dense kernels. It may therefore be worth trying GiMMiK on the AMD system.

Some care is also needed around how threads are placed. For the AMD CPUs you probably want one MPI rank per CCX (4 cores) with appropriate pinning.

Regards, Freddie.

2 Likes