TGV Performance Numbers

3 posts were split to a new topic: Questions on the TGV case

A post was merged into an existing topic: Questions on the TGV case

2.302 /1.61=1.43 ļ¼Œit seems close to 50%, why you say:

And I also run the case with A100 and MI100(one AMDā€™s GPU), the bandwidth of them both are 1.55TB/s and 1.2T/s ,but the performance numbers are 2.379Gdof/s and 2.020Gdof/s , the MI100 with hip is really slow . Do you have some tests with AMDā€™s GPU ?

Sure, I eyeballed it and it didnā€™t quite seem as high as 43% higher, but that still means we ~15% off from where we should be.

So are those numbers single and double precision for the Mi100? I have tested PyFR Mi100 previously and donā€™t remember them being that far off. In other topics you were talking about forcing cublas rather than gimmik. Are you doing something similar with rocblas?

How long are you running the cases for and how are you getting the execution time?

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled with double precision using pyfr=1.12.3 in env cuda:11.4.0-devel-ubuntu20.04 without any change. The execution time is got by "h5dump -d stats nse_tgv_3d_p3-1.00.pyfrs " just wall time. Another interesting thing is that when i force cublas or rocblas , the performance with rocblas is rather low

                    A100               cublas          MI100            rocblas
Runtime(s)          344.45             405.56          508.11           695.61
DoF                 4.096e6            4.096e6         4.096e6          4.096e6
RHS                 4.0e4              4.0e4           4.0e4            4.0e4
No.GPUs             1                  1               1                1    
s/DoF/RHS/GPU       2.102e-9           2.475           3.101e-9         4.246e-9   
GDoF/s              2.379              2.020           1.612            1.178

I write the wrong number before , in fact , the performance number with MI100 is only 1.612Gdof/s

1 Like

Can you use the data time at t=0 to get the compile time, you can then subtract this from the runtime from the t=1 solution file to get the true run time, just in case that is significant.

            gimmik-cuda     cublas      gimmik-hip      rocblas

Setuptime   23.31           20.82       23.68            29.54

Walltime    341.16          405.25      503.62          612.61

Runtime     317.85          384.43      479.94          583.07

DoF         4.10E+06        4.10E+06    4.10E+06        4.10E+06

RHS         4.00E+04        4.00E+04    4.00E+04        4.00E+04

No.GPUs     1               1           1               1

s/DoF/RHS/GPU   1.94E-09    2.35E-09    2.93E-09        3.56E-09

Gdof/s          2.577       2.131       1.707           1.405

The performance with AMD GPU MI100 is 50 % lower than with A100

1 Like

@rishit2307 ran this test for me on an AMD Mi100 system and we got the following:

single double
GDoF/s 3.43 1.76
s/DoF(grid)/RHS/GPU 1.46\times10^{-9} 2.84\times10^{-9}

The first row includes a factor of 5 in the DoF for the number of variables in Navierā€“Stokes, the second row doesnā€™t include this. So given the A100 results I reported above, the Mi100 results we get are about were I would expect.

Are you running with the NaN checker? If so, read the full topic above.

while it seems rather close to my results, @rishit2307 get 1.76GDoF/s and I get 1.707GDoF/s both in double percision

I rerun the case and get this:

	          A100 single	A100-double   Mi100single   Mi100double
Setuptime     23.85	        23.31         22.9          24.68
Walltime	  194.74	    341.16        269.6         501.28
Runtime	      170.89	    317.85        246.7         476.6
s/DoF/RHS/GPU 1.04E-09	    1.94E-09      1.51E-9       2.91E-9
Gdof/s	      4.794 	    2.577         3.321         1.719

The results with A100 is better than yours ?

Ok, I could decipher your previous numbers, but these are much clearer. It does seem your performance is about where we would expect.

There have been improvements to PyFR and the compilers since I ran this case on an A100, so it isnā€™t surprising that your newer performance numbers are better.

However the results show that the performance is not getting better on AMD hardware, considering the bandwidth peak of the Mi100 is only 30% ļ¼ˆMI100 1.2 T A100:1.55Tļ¼‰less than that in A100, the performance is not what I expected, could it be related to the compiler? Or does the program on HIP need to be further optimized?

Without profiling the kernels running on the Mi100 it is hard to say exactly what is going on.

It could be that the compiler isnā€™t optimising as well as nvrtc, it could be that the occupancy needs tweaks, it could be that there is some architecture quirk that we donā€™t know about. It could also be that there are some optimisation flags that we should pass to hiprtc, but arenā€™t. Sadly these options seem to be completely undocumented and the hiprtc source code isnā€™t very helpful.

Does this ini configuration correspond to the results of an articleļ¼Ÿ

The case is similar to some used in the following:

https://doi.org/10.1016/j.cpc.2020.107169
https://doi.org/10.1016/j.cpc.2021.108235
https://arxiv.org/abs/2111.07915

1 Like