# TGV Performance Numbers

Yes, which is in-line with what ZEFR gets (see Table 6 of https://www.sciencedirect.com/science/article/pii/S0010465520300229).

Regards, Freddie.

Interesting, on a V100 with single precision I get a runtime of 3123.431s, giving 3.81ns/DoF/RHS.

So in your units I get:

• A100 FP32: 1.98 GDoF/s
• V100 FP32: 1.31 GDoF/s

In our V100 machine (single GPU) we obtain the following in the TGV test case with RK45 time integration and double precision (data measured in ns/(Nvar * DoF)/RHS/GPU)

which is line with data of ZEFR as Freddie stated (they obtain 0.6 with p=3 and 0.65 with p=4). If I transform my data data units to yours (i.e by multiplying by 5) I obtain 3.35 ns/DoF/RHS/GPU. This indicates that V100 GPUs might be more performant than A100 GPUs in this case.

PS: Are 2e5 RHS evaluations needed to measure the performance? some averaging over hundreds of RHS evaluations after an arbitrary warm up period was sufficient in my tests. BTW does your runtime measurement include the initialization process? (it might be negligible since you have performed 2e5 RHS evaluations). I wrote a while ago a small performance plugin which measures the same performance parameter after a given number of time step evaluations. I might write a small PR with it in the future.

1 Like

I’d like to know what does the RHS in this case represent for? Is it the initial vorticity? How is it calculated?

RHS is what we use to refer to the spatial scheme. In this case, as we are using Navier–Stokes, this comes from the baseadvecdiff system class. See here.

We use this notation as you typically see a PDE written as:

\frac{\partial u}{\partial t} + \frac{\partial f(u)}{\partial x} = s

which then to apply the method of lines we rearrange to

\frac{\partial u}{\partial t} = -\frac{\partial f(u)}{\partial x} + s

The right-hand side in PyFR is handled using FR.

In other news, Freddie and I got to the bottom of the different performance number. The in the ini file I provided I was both using the integrate plugin and frequently using the NaN checker. As the NaN check transfers the solution from the GPU to the CPU via PCIe to then do a summation, this was really hitting performance. When I have some time in the next few days I will re-run the A100 tests.

I reran the case, for slightly less time (tend=1), and also turned off the integrate and nan checker plugins.

Here are the results for an A100 at p=3, for a 40^3 element mesh.

Single Double
Runtime [s] 216.256 355.794
DoF (40\times4)^3 (40\times4)^3
RHS 4\times10^4 4\times10^4
No. GPUs 1 1
s/DoF/RHS/GPU 1.320\times10^{-9} 2.172\times10^{-9}

Converting these to Freddie’s units this gives:

• FP32 3.788GDoF/s
• FP64 2.302GDoF/s

So the A100, as expected compares favourably with the V100, but given that the bandwidth is ~50% higher on the A100 I would’ve expected more.

Now either the case is too small to keep the GPU busy or some kernel isn’t performing well on the new hardware. I have some more data on this but this is getting a little out of scope for this topic, so will start a new thread in due course to discuss this further, if people are interested.

The moral of the story here is that the plugins can have significant overhead.
(We might want to think about moving the nan checker to device for GPU to avoid the PCIe bottleneck).

2 Likes

Regarding CPU backends for PyFR: given the current cost of an A100, I got to wondering how many 32-core Zen CPUs would it take to match a single A100. For non-steady compressible NS, if we also factor in larger time steps for implicit steppers, benchmarking gets tricky, but interesting!

Yeah, I’d like to know how their performance stacks up. Sadly, I don’t have access to any AMD CPU hardware, but shortly I should be able to let you know some MI100 performance numbers. If you’re interested you might like to take a look at this in the meantime, I show some performance numbers for the Mac M1 SoC.

At the end of the day though PyFR is most often memory bandwidth bound, so a simple ratio of bandwidth will give you a reasonable first order approximation of peak performance.

1 Like

Will,
The M1 numbers are seriously impressive for a sub 15W chip.
A couple of questions:

1. What is the measured power usage for the A100 during the run (from nvidia-smi)?
2. What Apple model are you running? Macbook Air? Pro? Mac Mini? Imac?

In regards to CPU performance I wil try to setup the cases with HW I have access to. We have a dual socket 32 core AMD Rome CPU node, as well as dual socket 18 core Intel Cascade Lake CPU nodes.
Intel will be able to utilize libxsmm and AVX512, while AMD will have the core count advantage. It will be an interesting comparison.

1 Like

From nvidia-smi the temperature was in the range of 230-240W (out of a max of 250W). But that doesn’t account for cooling etc, this is purely the device. The temperature of the GPU before I started was approx 30C and stabilised at 64/65C. So a significant amount of heat is having to be gotten rid of, and the ambient temperature of Teaxs, where the machine is located, won’t be helping.

The M1 results were obtained on a MacBook Pro 16GB, with the only real difference for this application being that this model has active cooling.

If you could run this benchmark, I would be interested to know what the performance is on the hardware you have access to.

Some things to note are that, Freddie recently made a PR to libxsmm that might help the tensor product element performance here. Also, Semih has been doing some great cache blocking work for CPU, which will help things, but that isn’t mainline ready quite yet.

1 Like

Sorry for the late response here. We had a memory configuration issue with our AMD Node that was killing performance.

I ran tests on a dual socket AMD Epyc 7H12 (64c, 8 memory channels) node, which is the top of line Rome SKU. For reference this SKU is claimed to approach the performance of the top bin Milan CPUs (within about 15%). I also tested on an Intel Xeon Gold 6240 (18c, 6 memory channels) node.

Both installations used PyFR v1.12.2, GCC v10.2.0, libxsmm (at commit suggested in the performance guide), Open MPI v4.1.1, UCX v1.11.0. All packages were compiled from source using Spack and built to target the specific microarchitecture.

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled. 1 MPI rank per socket was used.

1. AMD EPYC 7H12
Single Double
Runtime [s] 943 2058
DoF (40\times4)^3 (40\times4)^3
RHS 4\times10^4 4\times10^4
No. CPUs 2 2
s/DoF/RHS/CPU 2.88\times10^{-9} 6.28\times10^{-9}
1. Intel Xeon Gold 6240
Single Double
Runtime [s] 1740 3090
DoF (40\times4)^3 (40\times4)^3
RHS 4\times10^4 4\times10^4
No. CPUs 2 2
s/DoF/RHS/CPU 5.31\times10^{-9} 9.43\times10^{-9}

Summarizing the results, the AMD Rome, Zen2 based node outperforms the Intel Cascade Lake node by ~50% at DP and ~100% at SP for this case. As AMD holds a 3.5X core count advantage as well as an additional 33% more memory channels, I expected it to further outperform the Intel Node. It is possible that further optimizations could push AMD further ahead, like using their own AOCC compiler as well as finely tuned OpenMP thread placement.

Compared to the A100 GPU, assuming ideal scaling, ~8 Zen2 CPUs (4 dual socket nodes) would be needed to match the performance of a single A100 card for SP performance.

@WillT, could you re-verify the conclusions from your M1 performance blog post? My quick calculations shows that 21 M1 chips would be needed to match the SP performance of an A100 and not 4.6. Could be mistaken of course.Thanks.

Mike

1 Like

@mlaufer yep you are quite right on the M1 chip numbers. I previously made an error but clearly forgot to propagate it to that number. It should be ~21 M1s to one A100.

The AMD performance is held back by the lack of support for AVX-512. Specifically, GCC is unable (or unwilling) to vectorise the interface flux kernel on AVX2 CPUs. This represents a big hit, especially at single precision where the vectorisation is even more important. Additionally, libxsmm only has limited support for sparse kernels on AVX2 systems (and the only reason it exists at all is because I backported the AVX-512 code). For the TGV case at a reasonable order this might not be an issue, however certainly at higher orders libxsmm is forced to fall back to dense kernels. It may therefore be worth trying GiMMiK on the AMD system.

Some care is also needed around how threads are placed. For the AMD CPUs you probably want one MPI rank per CCX (4 cores) with appropriate pinning.

Regards, Freddie.

2 Likes

A post was split to a new topic: Installing PyFR with smack

3 posts were split to a new topic: Questions on the TGV case

A post was merged into an existing topic: Questions on the TGV case

2.302 /1.61=1.43 ，it seems close to 50%, why you say:

And I also run the case with A100 and MI100(one AMD’s GPU), the bandwidth of them both are 1.55TB/s and 1.2T/s ,but the performance numbers are 2.379Gdof/s and 2.020Gdof/s , the MI100 with hip is really slow . Do you have some tests with AMD’s GPU ?

Sure, I eyeballed it and it didn’t quite seem as high as 43% higher, but that still means we ~15% off from where we should be.

So are those numbers single and double precision for the Mi100? I have tested PyFR Mi100 previously and don’t remember them being that far off. In other topics you were talking about forcing cublas rather than gimmik. Are you doing something similar with rocblas?

How long are you running the cases for and how are you getting the execution time?

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled with double precision using pyfr=1.12.3 in env cuda:11.4.0-devel-ubuntu20.04 without any change. The execution time is got by "h5dump -d stats nse_tgv_3d_p3-1.00.pyfrs " just wall time. Another interesting thing is that when i force cublas or rocblas , the performance with rocblas is rather low

                    A100               cublas          MI100            rocblas
Runtime(s)          344.45             405.56          508.11           695.61
DoF                 4.096e6            4.096e6         4.096e6          4.096e6
RHS                 4.0e4              4.0e4           4.0e4            4.0e4
No.GPUs             1                  1               1                1
s/DoF/RHS/GPU       2.102e-9           2.475           3.101e-9         4.246e-9
GDoF/s              2.379              2.020           1.612            1.178


I write the wrong number before , in fact , the performance number with MI100 is only 1.612Gdof/s

1 Like

Can you use the data time at t=0 to get the compile time, you can then subtract this from the runtime from the t=1 solution file to get the true run time, just in case that is significant.