Question about Gimmik flops and Arithmetic intensity

I read paper about gimmik package GiMMiK—Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics - ScienceDirect . When we use high-order scheme, the kernel generated by gimmik compute very large matrix mat,

the none zero number is large than 1000, so I think this kernel should be a compute bound kernel .However when I run pyfr with Cylinder-3D case mentioned in On the utility of GPU accelerated high-order methods for unsteady flow simulations: A comparison with industry-standard tools - ScienceDirect, the gimmik kernel only got AI(Arithmetic intensity)=2

Why is that?

The actual amount of computation is still very small as for each entry you perform some FMAs, this is highlighted by the arithmetic intensity. (If you were computing several sine or cosines per point it might be a different matter.) Whereas on the memory side of things you are having to read in quite a few values and write out quite a lot.

We ensure that the memory access is coalesced to maximise the compuation from each read, but the reads are still slow with not enough work to occopy threads between reads. If you look at the stall reports, you’ll see that threads are probably stalled alot of the time due to global memory.

Maybe have a look at this paper for more info, you can do quite a lot of comuptation while remaining bandwidth bound. Also see this insightful paper for the V100

As a rule though, kernels in PyFR are generally limited by bandwidth not FLOPs and @fdw might like to add more nuance to this.