I googled the forum and no discussion on using FP16 in PyFR.
I just wonder if it is possible to use FP16 in PyFR?
Or a general question: Is that possible to use FP16 in CFD? Especially with the FR scheme.
I’m not talking about the linear solver. Some papers talk about the FP16 in the linear solver applied to solve a CFD system. FP16 could be used in for example the preconditioner or the Krylov solver with the iterative refiment.
PyFR has the AIAA Journal paper “Impact of Number Representation for High-Order Implicit Large-Eddy Simulations” comparing the single precision and the double precision. It is shown that using the single precision in PyFR is good in some cases.
So how about FP16?
On the whole I would say that FP16 in CFD isn’t a good idea, the reasons for this are twofold. Firstly with the standard FP16 representation (not BF16 of TP32) the epsilon is \sim 4\times10^{-4}, which is really quite low. As an example, some people run simulations at atmosphric pressure which is 101325. (Note: I don’t recommend this for FP32 let alone FP16) Under therse conditions you can easily encounter those rounding errors. However, considering this alone you could possible make a case for running the far field with half precision.
This brings me to the second point. Currently if you want to use half precision, what you have to is read in the FP32 value, store it in some FP16 register (probably packing two values to 32 bits), then perform the operation at FP16, then unpack and write to memory at FP32. The current bottleneck for our calculation is memory bandwidth and I encourage you to run the Nsight profiler on PyFR. The problem is that FP16 doesn’t solve this issue, in fact it will probably make it worse as you only gain time via the FP16 FMAs being quicker. Therefore you will probably spend even longer waiting for global loads. Now that all being said, you could maybe fix some of these problems, but in my experimentation it doesn’t really seem possible.
So the conclusion is that FP16 isn’t really precise enough for CFD calculations, except in some very specific circumstance but a better mesh will probably help there. And even if you were to use FP16, it probably would not speed things up as our work flow is different from their designed use case and generally we are constrained by memory bandwidth.
If you want to play around I wrote a version of GiMMiK that generates PTX rather than C. Withiin this I added suport for FP16 and BF32, although didn’t test it very extensively. If I remember correctly, TP32 is only available within the tensor cores.
Thank you for the answer.
I confirm that for now PyFR does not support FP16.
FP16 definitely losses too much precision if we use it across the whole CFD program. However, if we are looking at this from the RK framework, maybe a combination of FP16 and FP64/FP32 could bring us speedup without losing too much precision of the simulation. Some people proposed a mixed precision explicit RK scheme. I think it is worth a try. But it seems PyFR is not suitable for a demo case. I would write a small 2D FR code as a demo.
The process of loading/storing FP32 with computing FP16 is not true for the Armv8.2a. If GPU or other processing units support FP16 in the instruction sets, loading/storing FP32 is not needed. Then loading/storing FP16 also brings benefits to the memory traffic.