TGV Performance Numbers

3 posts were split to a new topic: Questions on the TGV case

A post was merged into an existing topic: Questions on the TGV case

2.302 /1.61=1.43 ,it seems close to 50%, why you say:

And I also run the case with A100 and MI100(one AMD’s GPU), the bandwidth of them both are 1.55TB/s and 1.2T/s ,but the performance numbers are 2.379Gdof/s and 2.020Gdof/s , the MI100 with hip is really slow . Do you have some tests with AMD’s GPU ?

Sure, I eyeballed it and it didn’t quite seem as high as 43% higher, but that still means we ~15% off from where we should be.

So are those numbers single and double precision for the Mi100? I have tested PyFR Mi100 previously and don’t remember them being that far off. In other topics you were talking about forcing cublas rather than gimmik. Are you doing something similar with rocblas?

How long are you running the cases for and how are you getting the execution time?

All tests were run up to t=1.0 with the NaN checker and Integration plugins disabled with double precision using pyfr=1.12.3 in env cuda:11.4.0-devel-ubuntu20.04 without any change. The execution time is got by "h5dump -d stats nse_tgv_3d_p3-1.00.pyfrs " just wall time. Another interesting thing is that when i force cublas or rocblas , the performance with rocblas is rather low

                    A100               cublas          MI100            rocblas
Runtime(s)          344.45             405.56          508.11           695.61
DoF                 4.096e6            4.096e6         4.096e6          4.096e6
RHS                 4.0e4              4.0e4           4.0e4            4.0e4
No.GPUs             1                  1               1                1    
s/DoF/RHS/GPU       2.102e-9           2.475           3.101e-9         4.246e-9   
GDoF/s              2.379              2.020           1.612            1.178

I write the wrong number before , in fact , the performance number with MI100 is only 1.612Gdof/s

1 Like

Can you use the data time at t=0 to get the compile time, you can then subtract this from the runtime from the t=1 solution file to get the true run time, just in case that is significant.

            gimmik-cuda     cublas      gimmik-hip      rocblas

Setuptime   23.31           20.82       23.68            29.54

Walltime    341.16          405.25      503.62          612.61

Runtime     317.85          384.43      479.94          583.07

DoF         4.10E+06        4.10E+06    4.10E+06        4.10E+06

RHS         4.00E+04        4.00E+04    4.00E+04        4.00E+04

No.GPUs     1               1           1               1

s/DoF/RHS/GPU   1.94E-09    2.35E-09    2.93E-09        3.56E-09

Gdof/s          2.577       2.131       1.707           1.405

The performance with AMD GPU MI100 is 50 % lower than with A100

1 Like

@rishit2307 ran this test for me on an AMD Mi100 system and we got the following:

single double
GDoF/s 3.43 1.76
s/DoF(grid)/RHS/GPU 1.46\times10^{-9} 2.84\times10^{-9}

The first row includes a factor of 5 in the DoF for the number of variables in Navier–Stokes, the second row doesn’t include this. So given the A100 results I reported above, the Mi100 results we get are about were I would expect.

Are you running with the NaN checker? If so, read the full topic above.

while it seems rather close to my results, @rishit2307 get 1.76GDoF/s and I get 1.707GDoF/s both in double percision

I rerun the case and get this:

	          A100 single	A100-double   Mi100single   Mi100double
Setuptime     23.85	        23.31         22.9          24.68
Walltime	  194.74	    341.16        269.6         501.28
Runtime	      170.89	    317.85        246.7         476.6
s/DoF/RHS/GPU 1.04E-09	    1.94E-09      1.51E-9       2.91E-9
Gdof/s	      4.794 	    2.577         3.321         1.719

The results with A100 is better than yours ?

Ok, I could decipher your previous numbers, but these are much clearer. It does seem your performance is about where we would expect.

There have been improvements to PyFR and the compilers since I ran this case on an A100, so it isn’t surprising that your newer performance numbers are better.

However the results show that the performance is not getting better on AMD hardware, considering the bandwidth peak of the Mi100 is only 30% (MI100 1.2 T A100:1.55T)less than that in A100, the performance is not what I expected, could it be related to the compiler? Or does the program on HIP need to be further optimized?

Without profiling the kernels running on the Mi100 it is hard to say exactly what is going on.

It could be that the compiler isn’t optimising as well as nvrtc, it could be that the occupancy needs tweaks, it could be that there is some architecture quirk that we don’t know about. It could also be that there are some optimisation flags that we should pass to hiprtc, but aren’t. Sadly these options seem to be completely undocumented and the hiprtc source code isn’t very helpful.