Question on the functionlaity of GiMMiK

Moved here from: Questions on the TGV case

Thank you ! I read the paper GiMMiK—Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics - ScienceDirect
I am very interested in Gimmik ,the Appendix in this paper give the test matrix.Now I have a question, is Gimmik used to compute multiply matrix A and B? Why there is only one matrix in any file?such as “Wozniak_GiMMiK_Supplemental/mats/quad/p1/gauss-legendre/m0.txt” ?

A good place to start is the source code for gimmik.

Gimmik generates source code for custom kernels for the general task of

\mathbf{C} := \alpha \mathbf{A}\mathbf{B} + \beta\mathbf{C}

where the matrix \mathbf{A} is constant. Because of this only one matrix is given in the supplementary material. In the FR algorithm these operation occur all the time. For example, the matrix you link to interpolates from the solution points to the flux points for a p1 quad with Gauss–Legendre points. The operator to do this interpolation doesn’t change but the data changes with each call to rhs, so gimmik is perfect for this.

So the matrix given in the supplementary material is constant A , while the matrix B is not given in supplementary material? Does the performance of Gimmik depend on the size of the matrix B ?

Yep, in the context of pyfr the matrix B could be the solution or the flux.

In the example of m0, that matrix would be [n flux points]x[n solution points] and B would be [n solution points]x[n elements]. From that you can see that the performance would depend on B in as much as the more elements there are the more work there is.

Excuse me , I read the source code in /backends/cuda/gimmik.py, I have a question, is it necessary to compile and generate a new kernel every time a matrix multiplication operation is performed, will it not be very slow?

class CUDAGiMMiKKernels(CUDAKernelProvider):
    def __init__(self, backend):
        super().__init__(backend)

    def mul(self, a, b, out, alpha=1.0, beta=0.0):
        # Ensure the matrices are compatible
        if a.nrow != out.nrow or a.ncol != b.nrow or b.ncol != out.ncol:
            raise ValueError('Incompatible matrices for out = a*b')

        # Check that A is constant
        if 'const' not in a.tags:
            raise NotSuitableError('GiMMiK requires a constant a matrix')

        # Fetch the matrix and tally up the number of non-zeros
        arr = a.get()
        nnz, nuq = np.count_nonzero(arr), len(np.unique(np.abs(arr)))

        # Check that A is suitable
        if nuq > 28 and nnz / arr.size > 0.15:
            raise NotSuitableError('Matrix is inappropriate for GiMMiK')

        # Generate
        src = generate_mm(arr, dtype=a.dtype, platform='cuda',
                          alpha=alpha, beta=beta)

        # Build
        fun = self._build_kernel('gimmik_mm', src,
                                 [np.int32, np.intp]*2 + [np.int32])
        fun.set_cache_pref(prefer_l1=True)


        # Determine the grid/block
        block = (128, 1, 1)
        grid = get_grid_for_block(block, b.ncol)

        class MulKernel(Kernel):
            def run(self, queue):
                fun.exec_async(grid, block, queue.stream, b.ncol, b, b.leaddim,
                               out, out.leaddim)

        return MulKernel()

Hi @luli - PyFR reuses everything it can. So for a given config, once the backend has compiled a particular kernel, that kernel gets stored in a [compute cache] and gets reused. It’s all VERY fast!

Yep, as nnunn says, PyFR caches the results of curtain function calls so when called again with the same input the function doesn’t have to be evaluated again, but rather the previous result can be used.

In the kernel pipeline of the cuda backend, the key line for this caching is here: PyFR/provider.py at 37479d405cf5e35cc42bce206d395e68274968bb · PyFR/PyFR · GitHub

1 Like

The function is executed every time a specific matrix multiplication kernel is requested. The operation itself, however, is performed by the kernel that is returned. During the simulation a kernel will be called many thousands of times. As such, the time required to acquire the kernel initially is not too important.

Regards, Freddie.

1 Like

Thank you ! I have another question , I noticed that CUDA and hip bankend have very few differences just in the driver.py. Why don’t we write the same program of CUDA and hip bankend together and distinguish them in the way of macro definition?

@fdw may want to add his view here, but I think the idea behind having separate backends is to allow for the inevitable divergence of cuda and hip syntax. Also the feature set for cuda is quite a lot richer and having the backends more separate makes it easier for these features to be used.

Thank you! I also notice that in HIP backend, compared with CUDA, “__launch_bounds” is appear in template in ".mako“ , is it used to limit the number of registers? Why doesn’t it appear in CUDA backend?

This hints to the hip compiler the number of threads per block when run, this can help the compiler make a number of more informed optimisations. That will mainly control how many registers the compiler thinks it can use, but could also include other things.

A similar variable can be found in cuda, but it has less of an impact on performance. For hip this setting can be critical to good performance; hence, why we include it as standard.