Hyperbolic diffusion and kernel fusion

A paper of our recently got accepted to CPC on the topic of hyperbolic diffusion and the optimisation opportunities this presents us with. See here.

Hyperbolic diffusion is a method where you transform the second order diffusion terms into first order terms by adding additional equations. This idea was first proposed by Hiro Nishikawa, who if you haven’t come across before I really recommend you check out his website: http://www.cfdbooks.com/, as well as his papers.

A main benefit of hyperbolic diffusion is that the stability of the resulting system scales with 1/h rather than 1/h^2. This is of course very helpful for high Reynold’s number flows. The additional benefit from an FR point of view is that you no longer have to do all the steps for a diffusion system, as its now purely advective. The downside is that you end up with quite a few equations, we focused on ACM for which you get 13 equations in 3D (rather than 4).

Once you have a purely advective system, however, there are some optimisations you can make to PyFR. The largest one being that normally you calculate the flux gradient by doing the following steps

U \rightarrow F \rightarrow \nabla\cdot F

where the intermediate F is calculated, written to global memory, and then read back in for the final \nabla\cdot F part. For diffusion systems you have to do this but for advection this is wasted bandwidth as we don’t use F again. So what we did was fuse these kernels together, and given the matrix used for the divergence it made most sense to do this for tensor product elements only.

The majority of the hard work was done by adding functionality to GiMMiK. This included an interesting memory manager to automatically handle the use of shared memory, as well as allow for automatic use of the new shared async memcopy commands on NVIDIA hardware. As a side note, the framework of GiMMiK is a really powerful tool, you can do stuff like this kernel fusion, but also things like generating CUDA intermediate assembly, PTX, rather than CUDA C. See here for some features that might get mainlined one day.

The top line result was that, after a bit of work, optimised hyperbolic diffusion was 2.3\times-2.6\times faster for the TGV than regular ACM.

P.S. Keep you eyes out for a recent work comparing ACM, ACM-HD and the alternative EDAC method.

1 Like