Find input of kernels

Guangz · 16 June 2021 11:19

I’d like to do some performance tests with some kernels. The kernels are in mako template, and I have changed them into c- or cuda-type with the methods from :https://groups.google.com/g/pyfrmailinglist/c/02B6O5UrI_o

But now I cannot find the input parameters and data of the kernels.
I donot know whether I can get them from the argument argb in line 183 of backends>base>kernels.py

The datatype is in python form.
Should I change the datatype into c form?
Or are there any other ways to find input parameters in PyFR?

WillT · 16 June 2021 14:15

Let me try and understand what you want to do a little better, as it might help to answer your question. Do you want to profile some of the kernels in pyfr? Because if you do, I can see that you have two choices (although it will depend on the backend and my experience is mainly with the OpenMP and Cuda backends), and maybe you have already considered these choices.

With CPU and CUDA you can launch PyFR and profile the whole application. I have had success in the past doing this with Intel’s Vtune and Nvidia’s NVVP. NVVP will not be supported for Ampere and later chips, and there are some issues caused by profiling the compilation stage of PyFR, but you can get around these. Nsight compute/system are Nvidia’s replacements, but I haven’t tried application profiling in these.

This brings me to option 2, which by the sounds of things is what you’ve been trying.

You can print out the generated source, and profile that. However, if you want to do that you’ll have to write a main that generates some data and then launches the kernel. I’ve been doing a lot of this recently, the main thing you have to do when doing this is to remember that pyfr uses an AOSOA data format with a default width of 32 on GPU. So the number of elements is padded up to a multiple of 32. On CPU it is due to get a bit tricky, with cache blocking, but Semih is the person to talk to about this.

You can find the AOSOA macro easily enough but here is a slightly easier to read version anyway:

#define SOA_SZ 32
#define NVARS 5
#define SOA_IDX(i, v) ((((i) /SOA_SZ)*NVARS + (v))*SOA_SZ + (i) % SOA_SZ)

i is the element number and v is the variable index.

Also from my experience, you probably want two mains: one that can just generate a large amount of input data and launches the kernel, this one you profile; a second one that only uses a small amount of data, so you can verify that the kernel is giving you the right answer.

In terms of how to generate the data, for the profiling main, just some random numbers for each of the inputs that won’t go on to generate NaN’s, or denormal numbers will be fine. I.e if profiling the Euler flux kernel, make sure density is positive. Below is an example of the code I use for some work on tensor product elements. It’s not elegant, but it does the job.

Let me know if I haven’t really answered what you were asking, or if you have any more questions.

void init_sol(int n, int p, int nvars, int ldb, WTYPE* b, WTYPE* xs){
    for (int i = 0; i < n; i++)
    for (int v = 0; v < nvars; v++)
        for (int i3=0; i3<p ; i3++)
        for (int i2=0; i2<p ; i2++)
        for (int i1=0; i1<p ; i1++)
            b[SOA_IDX(i, v) + (i1 + i2*p + i3*p*p)*ldb] = 3*v + cos((v+1)*M_PI*xs[i1])*sin((v+1)*M_PI*xs[i2])*sin((v+1)*M_PI*xs[i3]);
}

A few more bits on cuda profiling

For CUDA we currently prefer L1 cache, although in the future if/when we use shared memory this may change. By default, NVCC will set the cache preference to none, so to get the same behaviour as pyfr, you should change this. It can lead to differences.
In my experience profiling the CUDA kernels, you can sometimes see a performance difference between PyFR and benchmarks. I narrowed this down to PyFR using the NVRTC, whereas when you compile it yourself you’ll likely use NVCC. Looking at the SASS, NVCC seems to be able to interleave compute and loads better than the RTC, but at the moment my evidence for this is very anecdotal. This might be due to the --fassociative-math flag in NVCC, or something else, no idea.

Guangz · 17 June 2021 03:29

Really thank you very much! Yes, I am doing opition 2 and I am writing a main and generating data to launch the tfluxlin kernel. Now I may need some time to understand the AoSoA data format firstly and figure out what the index and the arguments represent for. Thanks again!

Topic		Replies	Views
How to profile PyFR? General	1	376	9 March 2015
Distinguishing between Mul kernels with profilers General	2	324	4 March 2021
How to transform Mako template into c- or cuda-type? General	2	184	3 July 2020
Matrix dimensions and cublas kernels Development	2	234	13 December 2022
Example Datasets/configs for PyFR? Cases	15	764	27 December 2022

Find input of kernels

A few more bits on cuda profiling

Related topics