Uncoalescing data format has a bad impact on the GPU kernel performance

Guangz · 15 September 2021 08:36

Dear developers, I am now testing some cases with PyFR, and I found that some uncoalescing data format could influence the performance of the GPU kernels in PyFR.

For example, in the kernel intcflux, through roofline model, we can find it is a memory bound kernel, and its performance will be mainly influenced by the bandwidth of the GPU. And when I use profiling tools to analyze it, I found its bandwidth efficiency is only about 40% and the during time is about 7ms.

I consider the problem is due to its uncoalescing data format. In kernel intcflux, there are 11 input parameters. In which the parameter *ul_vix, *ur_vix and *gradul_vix are the indexes of the array *ul_v, *ur_v and *gradul_v. The arrays are assembled in to AOSOA format. But when I take out the ul_vix data, I found it is not completely continuous. Such as shown below, there seems to be some numbers loss at the position of the red arrow. The same thing also happens to the *ur_vix and *gradul_vix. This could cause bandwidth waste when the thread try to read data in *ul_v, *ur_v and *gradul_v from cache.

Then I generate a continuous index array to do unit test with the kernel. Then the bandwidth efficiency comes to about 80% and the performance is highly improved to 2ms, though the result is obviously wrong.

The same thing also happens to the kernel intconu. I don’t know whether the indexes must should be uncontinuous? If we can make the indexes continuous on AOSOA data format, the performance will be highly improved.

fdw · 15 September 2021 12:39

It is not possible to make the accesses all coalesced. This is a simple consequence of the fact that PyFR supports unstructured grids and thus there is no simple relation between the left- and right-hand side indices at an interface between two elements. Hence a table is needed (the _vix arrays) with the entries being determined by the mesh itself. It follows that the indices can not be changed arbitrarily without causing the code to give an incorrect answer.

Regards, Freddie.

WillT · 15 September 2021 13:36

Just to add my two cents here, just reiterating what Freddie said, this is a known difficulty with these sorts of methods. Furthermore, the performance of the interface kernels is quite case dependent. For example, I just profiled a case and saw 86% bandwidth utilisation on intcflux.

If you are looking to make in improvement to PyFR, however, something me and Freddie have discussed is changing the ordering of the faces. It can have a noticable impact on the performance, but we currently aren’t complete sure of the best algorithm to use for the ordering.

Guangz · 16 September 2021 02:03

Thank you very much, so may I ask that the case you’ve tested is with structured mesh?

WillT · 16 September 2021 13:30

The mesh was simply a cube for the TGV test case, so yes it would’ve been structured, but PyFR is solely unstructured. As a result, the simple algorithm we currently use to order the interfaces was probably ok for this case, but in general, it isn’t optimal. You can easily come up with a pathological case to show that it isn’t optimal.

I used this case to show that there can be quite a lot of variation in the bandwidth of the interface kernels and, from the profiler, the memory access coalesced here. But as you highlight this isn’t always the case.

Guangz · 17 September 2021 09:42

Thank you very much! You are right, I tested with the TGV case and found that the bandwidth efficiency has been greatly improved.

Guangz · 8 November 2021 09:49

Hi Will, I’d like to know more about the reordering of the faces, can you tell me why the reordering can benefit the performance, and I’d like to find if there is anything I can do. Thanks very much!

WillT · 8 November 2021 10:35

The interface calculation kernel will draw values from the left and right faces, this data is unstructured with values potentially separated by large strides in memory. However, when a valued is fetched other values from the memory line will be cached. In a pathological case you could order the point pairs such that (nearly) every interface calculation causes a cache miss. However, there will be optimal arrangements of points that can maximise the cache hits. The tricky thing is working out how to algorithmically find this arrangement. And of course you can’t simply rearrange the left and right sides, as you have to preserve the correct communication at the interfaces.

Tools such as OP-DSL were designed with this problem in mind. OP_DSL is mainly aimed at unstructured FV solvers, but some of there technology might be a good place to start if you’re interested. Although the OP-DSL project seems to be a bit dormant from what I’ve seen and heard.

Topic		Replies	Views
Scaling studies for isentropic vortex General	4	172	10 March 2017
GPU parallelization and scalability Cases hpc , cuda	2	34	7 April 2025
Cache block question	1	24	14 September 2024
Example Datasets/configs for PyFR? Cases	15	783	27 December 2022
What would be a good explanation of PyFR having a good utilization of GPU acceleration? General	1	221	25 July 2019

Uncoalescing data format has a bad impact on the GPU kernel performance

Related topics