What's the relationship between Gimmik and cublas?

I read the paper Redirecting and, it seems that GIMMIk is better than cublas? Does the new version of pyfr still need to use the library cublas or not?

Gimmik is great in some circumstances, and with some of our ongoing work the set of circumstances that is works well for is always increasing. In the gimmik mul definition in pyfr there are some checks that are made, if the matrix doesn’t pass those checks a NotSuitableError is thrown. This is picked up by PyFR here, and then the next provider is tried.

What this means is PyFR will normally first try to use GiMMiK and then try cuBLAS as a fallback. @fdw may have a view on this, but I don’t see a situation where we completely remove cuBLAS as I think it is unlikely that GiMMiK will be as good as cuBLAS for large truly dense matrices, i.e those that occur for tets with order > ~5.

Thank you! How can I choose to use cuBLAS only? I want to compare the GiMMiK and cuBLAS performance with my specific case.

I think the option to do that from the ini file was removed a couple of versions ago. But if you force mul to throw NotSuitableError then you’ll fall back on cuBLAS. Throw the error as soon as you get into the function, somewhere around here: PyFR/gimmik.py at c8c053d5c0e34ac7a5639c4d328ebacde10fe689 · PyFR/PyFR · GitHub

To force PyFR to use GiMMiK you can comment out the line a bit later that throws that error.

OK, thank you! I also wonder the performance in different hardware like between A100 and A10. When testing the examples provided, it shows that the GPU-util is very low

. Is there anything wrong ?

That GPU utility number isn’t very informative about how a task is performing.

If you want to understand if a program is performing better I suggest using NVIDIA night compute (ncu), you might have to ask your system administrator for some additional permissions in order to use it.

With ncu you’ll be able to see the bandwidth and flop numbers for each kernel. This should give you a better idea of how it is performing. In my experience however, the performance with PyFR and GiMMiK on A100s is about where I would expect it. Although we are working on ways to make more use of some of the new features of the A100, such as async mem copies, that bypass registers when loading into shared.

As noted in the Examples section of the documentation

PyFR includes several test cases to showcase the functionality of the solver. It is important to note, however, that these examples are all relatively small 2D simulations and, as such, are not suitable for scalability or performance studies.

Regards, Freddie.

Thank you! Could you please provide some relatively large examples simulating with new verision PyFR?

Many of the papers by the PyFR team will include meshes in the supplementary material, which is a good place to look.

If you are trying to measure performance, this topic might be a good place to start: TGV Performance Numbers

I think I link to a reasonably sized cube mesh that you can try.

Is it cublas or cutlass that pyfr uses when calling the matrix multiplication math library?

cublas, see here: PyFR/cublas.py at d175beccd3fc5587903cc00c0f401041ff22abe4 · PyFR/PyFR · GitHub