TGV Performance Numbers

WillT · 17 February 2022 07:54

3 posts were split to a new topic: Questions on the TGV case

WillT · 18 February 2022 10:11

A post was merged into an existing topic: Questions on the TGV case

luli · 8 March 2022 06:26

2.302 /1.61=1.43 ，it seems close to 50%, why you say:

And I also run the case with A100 and MI100(one AMD’s GPU), the bandwidth of them both are 1.55TB/s and 1.2T/s ,but the performance numbers are 2.379Gdof/s and 2.020Gdof/s , the MI100 with hip is really slow . Do you have some tests with AMD’s GPU ?

WillT · 8 March 2022 08:49

Sure, I eyeballed it and it didn’t quite seem as high as 43% higher, but that still means we ~15% off from where we should be.

So are those numbers single and double precision for the Mi100? I have tested PyFR Mi100 previously and don’t remember them being that far off. In other topics you were talking about forcing cublas rather than gimmik. Are you doing something similar with rocblas?

How long are you running the cases for and how are you getting the execution time?

luli · 8 March 2022 10:26

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled with double precision using pyfr=1.12.3 in env cuda:11.4.0-devel-ubuntu20.04 without any change. The execution time is got by "h5dump -d stats nse_tgv_3d_p3-1.00.pyfrs " just wall time. Another interesting thing is that when i force cublas or rocblas , the performance with rocblas is rather low

                    A100               cublas          MI100            rocblas
Runtime(s)          344.45             405.56          508.11           695.61
DoF                 4.096e6            4.096e6         4.096e6          4.096e6
RHS                 4.0e4              4.0e4           4.0e4            4.0e4
No.GPUs             1                  1               1                1    
s/DoF/RHS/GPU       2.102e-9           2.475           3.101e-9         4.246e-9   
GDoF/s              2.379              2.020           1.612            1.178

I write the wrong number before , in fact , the performance number with MI100 is only 1.612Gdof/s

WillT · 8 March 2022 11:02

Can you use the data time at t=0 to get the compile time, you can then subtract this from the runtime from the t=1 solution file to get the true run time, just in case that is significant.

luli · 9 March 2022 03:43

            gimmik-cuda     cublas      gimmik-hip      rocblas

Setuptime   23.31           20.82       23.68            29.54

Walltime    341.16          405.25      503.62          612.61

Runtime     317.85          384.43      479.94          583.07

DoF         4.10E+06        4.10E+06    4.10E+06        4.10E+06

RHS         4.00E+04        4.00E+04    4.00E+04        4.00E+04

No.GPUs     1               1           1               1

s/DoF/RHS/GPU   1.94E-09    2.35E-09    2.93E-09        3.56E-09

Gdof/s          2.577       2.131       1.707           1.405

The performance with AMD GPU MI100 is 50 % lower than with A100

WillT · 9 March 2022 16:19

@rishit2307 ran this test for me on an AMD Mi100 system and we got the following:

	single	double
GDoF/s	3.43	1.76
s/DoF(grid)/RHS/GPU	1.46\times10^{-9}	2.84\times10^{-9}

The first row includes a factor of 5 in the DoF for the number of variables in Navier–Stokes, the second row doesn’t include this. So given the A100 results I reported above, the Mi100 results we get are about were I would expect.

Are you running with the NaN checker? If so, read the full topic above.

luli · 10 March 2022 02:34

while it seems rather close to my results, @rishit2307 get 1.76GDoF/s and I get 1.707GDoF/s both in double percision

luli · 10 March 2022 03:39

I rerun the case and get this:

	          A100 single	A100-double   Mi100single   Mi100double
Setuptime     23.85	        23.31         22.9          24.68
Walltime	  194.74	    341.16        269.6         501.28
Runtime	      170.89	    317.85        246.7         476.6
s/DoF/RHS/GPU 1.04E-09	    1.94E-09      1.51E-9       2.91E-9
Gdof/s	      4.794 	    2.577         3.321         1.719

The results with A100 is better than yours ?

WillT · 10 March 2022 08:42

Ok, I could decipher your previous numbers, but these are much clearer. It does seem your performance is about where we would expect.

There have been improvements to PyFR and the compilers since I ran this case on an A100, so it isn’t surprising that your newer performance numbers are better.

luli · 11 March 2022 07:53

However the results show that the performance is not getting better on AMD hardware, considering the bandwidth peak of the Mi100 is only 30% （MI100 1.2 T A100:1.55T）less than that in A100, the performance is not what I expected, could it be related to the compiler? Or does the program on HIP need to be further optimized?

WillT · 11 March 2022 09:22

Without profiling the kernels running on the Mi100 it is hard to say exactly what is going on.

It could be that the compiler isn’t optimising as well as nvrtc, it could be that the occupancy needs tweaks, it could be that there is some architecture quirk that we don’t know about. It could also be that there are some optimisation flags that we should pass to hiprtc, but aren’t. Sadly these options seem to be completely undocumented and the hiprtc source code isn’t very helpful.

luli · 5 July 2022 02:34

Does this ini configuration correspond to the results of an article？

WillT · 5 July 2022 10:18

The case is similar to some used in the following:

https://doi.org/10.1016/j.cpc.2020.107169
https://doi.org/10.1016/j.cpc.2021.108235
https://arxiv.org/abs/2111.07915

Topic		Replies	Views
Poor TGV performance with GPU and MPI Cases hpc , cuda	5	237	1 June 2023
TGV performance General	1	155	8 August 2019
Scaling studies for isentropic vortex General	4	170	10 March 2017
GPU parallelization and scalability Cases hpc , cuda	2	20	7 April 2025
Questions on the TGV case General	35	943	22 March 2022

TGV Performance Numbers

Related topics