Hyperbolic diffusion and kernel fusion

A paper of our recently got accepted to CPC on the topic of hyperbolic diffusion and the optimisation opportunities this presents us with. See here.

Hyperbolic diffusion is a method where you transform the second order diffusion terms into first order terms by adding additional equations. This idea was first proposed by Hiro Nishikawa, who if you haven’t come across before I really recommend you check out his website: http://www.cfdbooks.com/, as well as his papers.

A main benefit of hyperbolic diffusion is that the stability of the resulting system scales with 1/h rather than 1/h^2. This is of course very helpful for high Reynold’s number flows. The additional benefit from an FR point of view is that you no longer have to do all the steps for a diffusion system, as its now purely advective. The downside is that you end up with quite a few equations, we focused on ACM for which you get 13 equations in 3D (rather than 4).

Once you have a purely advective system, however, there are some optimisations you can make to PyFR. The largest one being that normally you calculate the flux gradient by doing the following steps

U \rightarrow F \rightarrow \nabla\cdot F

where the intermediate F is calculated, written to global memory, and then read back in for the final \nabla\cdot F part. For diffusion systems you have to do this but for advection this is wasted bandwidth as we don’t use F again. So what we did was fuse these kernels together, and given the matrix used for the divergence it made most sense to do this for tensor product elements only.

The majority of the hard work was done by adding functionality to GiMMiK. This included an interesting memory manager to automatically handle the use of shared memory, as well as allow for automatic use of the new shared async memcopy commands on NVIDIA hardware. As a side note, the framework of GiMMiK is a really powerful tool, you can do stuff like this kernel fusion, but also things like generating CUDA intermediate assembly, PTX, rather than CUDA C. See here for some features that might get mainlined one day.

The top line result was that, after a bit of work, optimised hyperbolic diffusion was 2.3\times-2.6\times faster for the TGV than regular ACM.

P.S. Keep you eyes out for a recent work comparing ACM, ACM-HD and the alternative EDAC method.

1 Like

Has this method been applied to PyFR? If so, from which version?

No this is not a mainline feature of pyfr. I implemented on my fork of PyFR, which can be found here: GitHub - WillTrojak/PyFR at feature/tensor

and would need a custom version of Gimmik: GitHub - WillTrojak/GiMMiK at sparse

However, this is now three years old, so it probably won’t work anymore, nor was it intended to be used by anyone but me. This is all a way of me saying, I can’t really support you if you want to use it. But you can have a look to see how things were implemented.

Too bad, why not add it as a feature to PyFR? I think it would be a great job!

Maintaining features requires a significant amount of work on our part. Thus we do not normally add things unless there are very completing advantages.

Regards, Freddie.

Given the complexity of the implementation and operation of this particular feature, the win wasn’t compelling enough to mainline it. However, if that were to change, we would definitely consider it.