Let me try and understand what you want to do a little better, as it might help to answer your question. Do you want to profile some of the kernels in pyfr? Because if you do, I can see that you have two choices (although it will depend on the backend and my experience is mainly with the OpenMP and Cuda backends), and maybe you have already considered these choices.
- With CPU and CUDA you can launch PyFR and profile the whole application. I have had success in the past doing this with Intel’s Vtune and Nvidia’s NVVP. NVVP will not be supported for Ampere and later chips, and there are some issues caused by profiling the compilation stage of PyFR, but you can get around these. Nsight compute/system are Nvidia’s replacements, but I haven’t tried application profiling in these.
This brings me to option 2, which by the sounds of things is what you’ve been trying.
- You can print out the generated source, and profile that. However, if you want to do that you’ll have to write a main that generates some data and then launches the kernel. I’ve been doing a lot of this recently, the main thing you have to do when doing this is to remember that pyfr uses an AOSOA data format with a default width of 32 on GPU. So the number of elements is padded up to a multiple of 32. On CPU it is due to get a bit tricky, with cache blocking, but Semih is the person to talk to about this.
You can find the AOSOA macro easily enough but here is a slightly easier to read version anyway:
#define SOA_SZ 32
#define NVARS 5
#define SOA_IDX(i, v) ((((i) /SOA_SZ)*NVARS + (v))*SOA_SZ + (i) % SOA_SZ)
i
is the element number and v
is the variable index.
Also from my experience, you probably want two mains: one that can just generate a large amount of input data and launches the kernel, this one you profile; a second one that only uses a small amount of data, so you can verify that the kernel is giving you the right answer.
In terms of how to generate the data, for the profiling main, just some random numbers for each of the inputs that won’t go on to generate NaN’s, or denormal numbers will be fine. I.e if profiling the Euler flux kernel, make sure density is positive. Below is an example of the code I use for some work on tensor product elements. It’s not elegant, but it does the job.
Let me know if I haven’t really answered what you were asking, or if you have any more questions.
void init_sol(int n, int p, int nvars, int ldb, WTYPE* b, WTYPE* xs){
for (int i = 0; i < n; i++)
for (int v = 0; v < nvars; v++)
for (int i3=0; i3<p ; i3++)
for (int i2=0; i2<p ; i2++)
for (int i1=0; i1<p ; i1++)
b[SOA_IDX(i, v) + (i1 + i2*p + i3*p*p)*ldb] = 3*v + cos((v+1)*M_PI*xs[i1])*sin((v+1)*M_PI*xs[i2])*sin((v+1)*M_PI*xs[i3]);
}
A few more bits on cuda profiling
-
For CUDA we currently prefer L1 cache, although in the future if/when we use shared memory this may change. By default, NVCC will set the cache preference to none, so to get the same behaviour as pyfr, you should change this. It can lead to differences.
-
In my experience profiling the CUDA kernels, you can sometimes see a performance difference between PyFR and benchmarks. I narrowed this down to PyFR using the NVRTC, whereas when you compile it yourself you’ll likely use NVCC. Looking at the SASS, NVCC seems to be able to interleave compute and loads better than the RTC, but at the moment my evidence for this is very anecdotal. This might be due to the --fassociative-math
flag in NVCC, or something else, no idea.