The actual amount of computation is still very small as for each entry you perform some FMAs, this is highlighted by the arithmetic intensity. (If you were computing several sine or cosines per point it might be a different matter.) Whereas on the memory side of things you are having to read in quite a few values and write out quite a lot.
We ensure that the memory access is coalesced to maximise the compuation from each read, but the reads are still slow with not enough work to occopy threads between reads. If you look at the stall reports, you’ll see that threads are probably stalled alot of the time due to global memory.