I saw this in paper “PyFR v2.0.3: Towards Industrial Adoption of Scale-Resolving Simulations”
Cache blocking. A powerful means of reducing the memory bandwidth requirements of a code on conventional CPUs is cache blocking [13]. The idea is to improve data locality by changing the order in which kernels are called. An example of this can be seen in Fig. 5 which shows how a pair of array addition kernels can be rearranged to reduce bandwidth requirements. A key advantage of cache blocking compared with alternative approaches, such as kernel fusion, is that the kernels themselves do not require modification; all that changes is the arguments to the kernels. Historically, cache blocking has not been viable for high-order codes due to the size of the intermediate arrays which are generated by kernels. For example, an Intel Ivy Bridge CPU core from 2013 only has 256 KiB of L2 cache which is shared between executable code and data. As a point of reference, for the Euler equations, storing the solution and flux for just eight ℘ = 4 hexahedra at double precision requires 160 kB. Since 2016, however, there has been a marked increase in the size of private caches, with Intel Golden Cove CPU cores having 2 MiB. The specifics involved in cache blocking FR are detailed in [13, 14] and can improve performance by a factor of two. Within PyFR, cache blocking is accomplished by calling auxiliary methods on task graphs stating which kernels in the graph are suitable for blocking transformations. The interface also contains support for eliminating temporary arrays which can further improve performance.
It is claimed here that cache block technology is used on the CPU to optimise performance, is it also used on the GPU?