The actual amount of computation is still very small as for each entry you perform some FMAs, this is highlighted by the arithmetic intensity. (If you were computing several sine or cosines per point it might be a different matter.) Whereas on the memory side of things you are having to read in quite a few values and write out quite a lot.
We ensure that the memory access is coalesced to maximise the compuation from each read, but the reads are still slow with not enough work to occopy threads between reads. If you look at the stall reports, you’ll see that threads are probably stalled alot of the time due to global memory.
Maybe have a look at this paper for more info, you can do quite a lot of comuptation while remaining bandwidth bound. Also see this insightful paper for the V100 https://arxiv.org/pdf/1804.06826.pdf
As a rule though, kernels in PyFR are generally limited by bandwidth not FLOPs and @fdw might like to add more nuance to this.