Dear developers, I am now testing some cases with PyFR, and I found that some uncoalescing data format could influence the performance of the GPU kernels in PyFR.
For example, in the kernel intcflux, through roofline model, we can find it is a memory bound kernel, and its performance will be mainly influenced by the bandwidth of the GPU. And when I use profiling tools to analyze it, I found its bandwidth efficiency is only about 40% and the during time is about 7ms.
I consider the problem is due to its uncoalescing data format. In kernel intcflux, there are 11 input parameters. In which the parameter *ul_vix, *ur_vix and *gradul_vix are the indexes of the array *ul_v, *ur_v and *gradul_v. The arrays are assembled in to AOSOA format. But when I take out the ul_vix data, I found it is not completely continuous. Such as shown below, there seems to be some numbers loss at the position of the red arrow. The same thing also happens to the *ur_vix and *gradul_vix. This could cause bandwidth waste when the thread try to read data in *ul_v, *ur_v and *gradul_v from cache.
Then I generate a continuous index array to do unit test with the kernel. Then the bandwidth efficiency comes to about 80% and the performance is highly improved to 2ms, though the result is obviously wrong.
The same thing also happens to the kernel intconu. I don’t know whether the indexes must should be uncontinuous? If we can make the indexes continuous on AOSOA data format, the performance will be highly improved.