Cache block question

I saw this in paper “PyFR v2.0.3: Towards Industrial Adoption of Scale-Resolving Simulations”

Cache blocking. A powerful means of reducing the memory bandwidth requirements of a code on conventional CPUs is cache blocking [13]. The idea is to improve data locality by changing the order in which kernels are called. An example of this can be seen in Fig. 5 which shows how a pair of array addition kernels can be rearranged to reduce bandwidth requirements. A key advantage of cache blocking compared with alternative approaches, such as kernel fusion, is that the kernels themselves do not require modification; all that changes is the arguments to the kernels. Historically, cache blocking has not been viable for high-order codes due to the size of the intermediate arrays which are generated by kernels. For example, an Intel Ivy Bridge CPU core from 2013 only has 256 KiB of L2 cache which is shared between executable code and data. As a point of reference, for the Euler equations, storing the solution and flux for just eight ℘ = 4 hexahedra at double precision requires 160 kB. Since 2016, however, there has been a marked increase in the size of private caches, with Intel Golden Cove CPU cores having 2 MiB. The specifics involved in cache blocking FR are detailed in [13, 14] and can improve performance by a factor of two. Within PyFR, cache blocking is accomplished by calling auxiliary methods on task graphs stating which kernels in the graph are suitable for blocking transformations. The interface also contains support for eliminating temporary arrays which can further improve performance.

It is claimed here that cache block technology is used on the CPU to optimise performance, is it also used on the GPU?

No, GPUs have a tiny amount of shared cache relative to their compute performance and so blocking is not currently a viable option.

At an extreme level on a CPU we can have a core with ~1 MiB of L2 cache work on as few as 8 elements and get full utilization. In contrast a GPU needs to operate on well over 2,500 elements to get full utilization. This suggests we’d need on the order of ~312 MiB of shared cache whereas current AMD and NVIDIA GPUs max out at ~100 MiB.

Regards, Freddie.

2 Likes