I’m trying to profile a new functionality in PyFR and I would like to distinguish between the different Mul kernels with the CUDA backend when profiling simulations with NVIDIA’s nvsys tool. I’m currently forcing the generation of all Mul kernels with GiMMiK and in the current implementation these kernels share the same function name, therefore the profiler does not allow to distinguish between the different Mul kernels. This does not allow to analyze which is the performance metrics of each kernel.
Does any one know how to differentitate between different Mul kernels when profiling applications with NVIDIA’s nvsys tool? If I’m not mistakend when working with OpenMP backend one may rely on the option PYFR_DEBUG_OMP_KEEP_LIBS to distinguish different kernels with the profilers as described in [1].
For GiMMiK the easiest solution is to change the name of the kernel. After the kernel is generated:
you can do src = src.replace('gimmik_mm', f'gimmik_mm_{suffix}'), making sure to use this new name on the line below. One possibility for the suffix is: suffix = ''.join(t for t in a.tags if t.startswith('M')) which should pick out the name of the operator in question.
Thank you for your suggestions, I was sure that I was missing something. Just to add some comments to your solution, it seems that one needs to also change the name of the kernel and replace white spaces and arithmetic operators by other symbols. So I ended up with something like
suffix = ''.join(t for t in a.tags if t.startswith('M'))
suffix = suffix.replace(' ', '')
suffix = suffix.replace('-', '_minus_')
suffix = suffix.replace('+', '_plus_')
suffix = suffix.replace('*', '_mul_')
src = src.replace('gimmik_mm', f'gimmik_mm_{suffix}')
fun = self._build_kernel(f'gimmik_mm_{suffix}', src,
[np.int32, np.intp]*2 + [np.int32])
Anyway, now I can perfectly distinguish between the different kernels and see their impact in the performance. Thank you!