TGV Performance Numbers

WillT · 3 June 2021 18:42

I was recently asked how Ampere GPUs compare to CPU, I have previously given some performance numbers, but decided to make more of a benchmark case for comparison.

I ran a 40^3 hex mesh with p=3, which gives about 4.1 million solution points. The case I used was of course the TGV and below is the ini file. Given that we aren’t interested in the physics here, I only ran it to t=5.

These results are using PyFR v 1.10, on a single A100, using CUDA 11.0.2 and GCC 9.2.0.

	Single	Double
Runtime [s]	2069.735	3183.357
DoF	(40\times4)^3	(40\times4)^3
RHS	2\times10^5	2\times10^5
No. GPUs	1	1
s/DoF/RHS/GPU	2.53\times10^{-9}	3.89\times10^{-9}

Below is the ini file for double precision and a link to the mesh.

If anyone else wants to run the case for comparison I would be interested to see the performance numbers, I only really have access to GPU orientated machines.

[backend]
precision = double
rank-allocator = linear

[backend-cuda]
device-id = local-rank
# gimmik-max-nnz is not required in later pyfr versions
gimmik-max-nnz = 512
mpi-type = cuda-aware
block-1d = 64
block-2d = 128

[constants]
gamma = 1.4
mu = 6.25e-4
Pr = 0.71
Ps = 111.607

[solver]
system = navier-stokes
order = 3
anti-alias = none
viscosity-correction = none
shock-capturing = none

[solver-time-integrator]
scheme = rk4
controller = none
tstart = 0
tend = 5.01
dt = 1e-4

[solver-interfaces]
riemann-solver = rusanov
# These ldg settings will hit the interface communication a little harder
ldg-beta = 0.0
ldg-tau = 0.0

[solver-interfaces-quad]
flux-pts = gauss-legendre
quad-deg = 6
quad-pts = gauss-legendre

[solver-elements-hex]
soln-pts = gauss-legendre
quad-deg = 6
quad-pts = gauss-legendre

[soln-plugin-nancheck]
nsteps = 10

[soln-plugin-integrate]
nsteps = 50
file = integral_fp64.csv
header = true
vor1 = (grad_w_y - grad_v_z)
vor2 = (grad_u_z - grad_w_x)
vor3 = (grad_v_x - grad_u_y)

int-E = rho*(u*u + v*v + w*w)
int-enst = rho*(%(vor1)s*%(vor1)s + %(vor2)s*%(vor2)s + %(vor3)s*%(vor3)s)

[soln-plugin-writer]
dt-out = 5
# depending on the environment you might want to change this
basedir = .
basename = nse_fp64_tgv_3d_p3-{t:.2f}

[soln-ics]
rho = 1
u = sin(x)*cos(y)*cos(z)
v = -cos(x)*sin(y)*cos(z)
w = 0
p = Ps + (1.0/16.0)*(cos(2*x) + cos(2*y))*(cos(2*z + 2))

Google drive link to gmsh mesh file

WillT · 3 June 2021 18:44

FYI the runtime and RHS number come fro the solution file using a command similar to $ h5dump -d stats nse_tgv_3d_p3-5.00.pyfrs

fdw · 4 June 2021 17:26

I am unsure about these results. For reference, running with a hex mesh on a V100 Semih and myself have found:

Where the axis is in Giga DOF/s which corresponds to the reciprocal of your numbers in ns. I am also unsure about the factor of 4 in the DOF calculation (should it not be 5?)

Regards, Freddie.

WillT · 4 June 2021 17:30

The 4 is from p+1, when you do your calculation do you count each variable as a DoF? If so this should be 5\times(40\times4)^3

fdw · 4 June 2021 17:38

Ah, you’re running p = 3 not p = 4. However, I am still puzzled. My V100 p = 3 results are at 1.61 GDOF/s including the factor of 5 from the variables. Converting to your units we have 1/(1.6e9 / 5) = 3.1e-9. Thus a V100 is comfortably outperforming an A100 here.

Regards, Freddie.

WillT · 4 June 2021 17:41

Hmm, did you get 1.61 GDoF/s for this case?

fdw · 4 June 2021 17:44

Yes, which is in-line with what ZEFR gets (see Table 6 of https://www.sciencedirect.com/science/article/pii/S0010465520300229).

Regards, Freddie.

WillT · 4 June 2021 18:50

Interesting, on a V100 with single precision I get a runtime of 3123.431s, giving 3.81ns/DoF/RHS.

So in your units I get:

A100 FP32: 1.98 GDoF/s
V100 FP32: 1.31 GDoF/s

GonzaloS · 5 June 2021 09:35

In our V100 machine (single GPU) we obtain the following in the TGV test case with RK45 time integration and double precision (data measured in ns/(Nvar * DoF)/RHS/GPU)

which is line with data of ZEFR as Freddie stated (they obtain 0.6 with p=3 and 0.65 with p=4). If I transform my data data units to yours (i.e by multiplying by 5) I obtain 3.35 ns/DoF/RHS/GPU. This indicates that V100 GPUs might be more performant than A100 GPUs in this case.

PS: Are 2e5 RHS evaluations needed to measure the performance? some averaging over hundreds of RHS evaluations after an arbitrary warm up period was sufficient in my tests. BTW does your runtime measurement include the initialization process? (it might be negligible since you have performed 2e5 RHS evaluations). I wrote a while ago a small performance plugin which measures the same performance parameter after a given number of time step evaluations. I might write a small PR with it in the future.

Guangz · 8 June 2021 02:03

I’d like to know what does the RHS in this case represent for? Is it the initial vorticity? How is it calculated?

WillT · 8 June 2021 13:40

RHS is what we use to refer to the spatial scheme. In this case, as we are using Navier–Stokes, this comes from the baseadvecdiff system class. See here.

We use this notation as you typically see a PDE written as:

\frac{\partial u}{\partial t} + \frac{\partial f(u)}{\partial x} = s

which then to apply the method of lines we rearrange to

\frac{\partial u}{\partial t} = -\frac{\partial f(u)}{\partial x} + s

The right-hand side in PyFR is handled using FR.

In other news, Freddie and I got to the bottom of the different performance number. The in the ini file I provided I was both using the integrate plugin and frequently using the NaN checker. As the NaN check transfers the solution from the GPU to the CPU via PCIe to then do a summation, this was really hitting performance. When I have some time in the next few days I will re-run the A100 tests.

WillT · 9 June 2021 14:03

I reran the case, for slightly less time (tend=1), and also turned off the integrate and nan checker plugins.

Here are the results for an A100 at p=3, for a 40^3 element mesh.

	Single	Double
Runtime [s]	216.256	355.794
DoF	(40\times4)^3	(40\times4)^3
RHS	4\times10^4	4\times10^4
No. GPUs	1	1
s/DoF/RHS/GPU	1.320\times10^{-9}	2.172\times10^{-9}

Converting these to Freddie’s units this gives:

FP32 3.788GDoF/s
FP64 2.302GDoF/s

So the A100, as expected compares favourably with the V100, but given that the bandwidth is ~50% higher on the A100 I would’ve expected more.

Now either the case is too small to keep the GPU busy or some kernel isn’t performing well on the new hardware. I have some more data on this but this is getting a little out of scope for this topic, so will start a new thread in due course to discuss this further, if people are interested.

The moral of the story here is that the plugins can have significant overhead.
(We might want to think about moving the nan checker to device for GPU to avoid the PCIe bottleneck).

nnunn · 22 June 2021 06:40

Regarding CPU backends for PyFR: given the current cost of an A100, I got to wondering how many 32-core Zen CPUs would it take to match a single A100. For non-steady compressible NS, if we also factor in larger time steps for implicit steppers, benchmarking gets tricky, but interesting!

WillT · 22 June 2021 13:56

Yeah, I’d like to know how their performance stacks up. Sadly, I don’t have access to any AMD CPU hardware, but shortly I should be able to let you know some MI100 performance numbers. If you’re interested you might like to take a look at this in the meantime, I show some performance numbers for the Mac M1 SoC.

At the end of the day though PyFR is most often memory bandwidth bound, so a simple ratio of bandwidth will give you a reasonable first order approximation of peak performance.

mlaufer · 22 June 2021 19:02

Will,
The M1 numbers are seriously impressive for a sub 15W chip.
A couple of questions:

What is the measured power usage for the A100 during the run (from nvidia-smi)?
What Apple model are you running? Macbook Air? Pro? Mac Mini? Imac?

In regards to CPU performance I wil try to setup the cases with HW I have access to. We have a dual socket 32 core AMD Rome CPU node, as well as dual socket 18 core Intel Cascade Lake CPU nodes.
Intel will be able to utilize libxsmm and AVX512, while AMD will have the core count advantage. It will be an interesting comparison.

WillT · 22 June 2021 22:54

From nvidia-smi the temperature was in the range of 230-240W (out of a max of 250W). But that doesn’t account for cooling etc, this is purely the device. The temperature of the GPU before I started was approx 30C and stabilised at 64/65C. So a significant amount of heat is having to be gotten rid of, and the ambient temperature of Teaxs, where the machine is located, won’t be helping.

The M1 results were obtained on a MacBook Pro 16GB, with the only real difference for this application being that this model has active cooling.

If you could run this benchmark, I would be interested to know what the performance is on the hardware you have access to.

Some things to note are that, Freddie recently made a PR to libxsmm that might help the tensor product element performance here. Also, Semih has been doing some great cache blocking work for CPU, which will help things, but that isn’t mainline ready quite yet.

mlaufer · 24 August 2021 19:58

Sorry for the late response here. We had a memory configuration issue with our AMD Node that was killing performance.

I ran tests on a dual socket AMD Epyc 7H12 (64c, 8 memory channels) node, which is the top of line Rome SKU. For reference this SKU is claimed to approach the performance of the top bin Milan CPUs (within about 15%). I also tested on an Intel Xeon Gold 6240 (18c, 6 memory channels) node.

Both installations used PyFR v1.12.2, GCC v10.2.0, libxsmm (at commit suggested in the performance guide), Open MPI v4.1.1, UCX v1.11.0. All packages were compiled from source using Spack and built to target the specific microarchitecture.

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled. 1 MPI rank per socket was used.

AMD EPYC 7H12

	Single	Double
Runtime [s]	943	2058
DoF	(40\times4)^3	(40\times4)^3
RHS	4\times10^4	4\times10^4
No. CPUs	2	2
s/DoF/RHS/CPU	2.88\times10^{-9}	6.28\times10^{-9}

Intel Xeon Gold 6240

	Single	Double
Runtime [s]	1740	3090
DoF	(40\times4)^3	(40\times4)^3
RHS	4\times10^4	4\times10^4
No. CPUs	2	2
s/DoF/RHS/CPU	5.31\times10^{-9}	9.43\times10^{-9}

Summarizing the results, the AMD Rome, Zen2 based node outperforms the Intel Cascade Lake node by ~50% at DP and ~100% at SP for this case. As AMD holds a 3.5X core count advantage as well as an additional 33% more memory channels, I expected it to further outperform the Intel Node. It is possible that further optimizations could push AMD further ahead, like using their own AOCC compiler as well as finely tuned OpenMP thread placement.

Compared to the A100 GPU, assuming ideal scaling, ~8 Zen2 CPUs (4 dual socket nodes) would be needed to match the performance of a single A100 card for SP performance.

@WillT, could you re-verify the conclusions from your M1 performance blog post? My quick calculations shows that 21 M1 chips would be needed to match the SP performance of an A100 and not 4.6. Could be mistaken of course.Thanks.

Mike

WillT · 24 August 2021 21:35

@mlaufer yep you are quite right on the M1 chip numbers. I previously made an error but clearly forgot to propagate it to that number. It should be ~21 M1s to one A100.

fdw · 24 August 2021 21:52

The AMD performance is held back by the lack of support for AVX-512. Specifically, GCC is unable (or unwilling) to vectorise the interface flux kernel on AVX2 CPUs. This represents a big hit, especially at single precision where the vectorisation is even more important. Additionally, libxsmm only has limited support for sparse kernels on AVX2 systems (and the only reason it exists at all is because I backported the AVX-512 code). For the TGV case at a reasonable order this might not be an issue, however certainly at higher orders libxsmm is forced to fall back to dense kernels. It may therefore be worth trying GiMMiK on the AMD system.

Some care is also needed around how threads are placed. For the AMD CPUs you probably want one MPI rank per CCX (4 cores) with appropriate pinning.

Regards, Freddie.

WillT · 8 February 2022 15:35

A post was split to a new topic: Installing PyFR with smack

Topic		Replies	Views
Poor TGV performance with GPU and MPI Cases hpc , cuda	5	243	1 June 2023
Example Datasets/configs for PyFR? Cases	15	791	27 December 2022
Bad Performance and warning General	9	438	19 May 2021
Memory on CUDA GPU backend General	14	1066	18 November 2021
Sd7003 case performance with OpenMP Cases incompressible , openmp	8	386	9 February 2020

TGV Performance Numbers

Related topics