TGV Performance Numbers

WillT · 9 June 2021 14:03

I reran the case, for slightly less time (tend=1), and also turned off the integrate and nan checker plugins.

Here are the results for an A100 at p=3, for a 40^3 element mesh.

	Single	Double
Runtime [s]	216.256	355.794
DoF	(40\times4)^3	(40\times4)^3
RHS	4\times10^4	4\times10^4
No. GPUs	1	1
s/DoF/RHS/GPU	1.320\times10^{-9}	2.172\times10^{-9}

Converting these to Freddie’s units this gives:

FP32 3.788GDoF/s
FP64 2.302GDoF/s

So the A100, as expected compares favourably with the V100, but given that the bandwidth is ~50% higher on the A100 I would’ve expected more.

Now either the case is too small to keep the GPU busy or some kernel isn’t performing well on the new hardware. I have some more data on this but this is getting a little out of scope for this topic, so will start a new thread in due course to discuss this further, if people are interested.

The moral of the story here is that the plugins can have significant overhead.
(We might want to think about moving the nan checker to device for GPU to avoid the PCIe bottleneck).

Topic		Replies	Views
Poor TGV performance with GPU and MPI Cases hpc , cuda	5	238	1 June 2023
TGV performance General	1	155	8 August 2019
Scaling studies for isentropic vortex General	4	172	10 March 2017
GPU parallelization and scalability Cases hpc , cuda	2	34	7 April 2025
Questions on the TGV case General	35	944	22 March 2022

TGV Performance Numbers

Related topics