Cuda backend parameters


I am wondering if someone can provide a bit more descriptions on these parameters to optimize performance.

As far as I know, when using multiple GPUs, I had to select local-rank for device-id and cuda-aware for mpi-type. When exactly should i be using round-robin and local-rank? And when should i be using standard or cuda-aware?

How would you select GiMMiK cutoff? How does it affect accuracy / performance?

I believe block-1d and block-2d are determined by GPU’s specification. I am not very familiar with Cuda. Please someone can elaborate a bit. For example I am running pyfr with two Tesla k80s in parallel, what’s the block size for 1d and 2d pointswise kernels?

Parameterises the CUDA backend with

  1. device-id — method for selecting which device(s) to run on:

    int | round-robin | local-rank

  2. gimmik-max-nnz — cutoff for GiMMiK in terms of the number of non-zero entires in a constant matrix:


  3. mpi-type — type of MPI library that is being used:

    standard | cuda-aware

  4. block-1d — block size for one dimensional pointwise kernels:


  5. block-2d — block size for two dimensional pointwise kernels:

    int, int

Thanks a lot!

Junting Chen

Hi Junting,

As far as I know, when using multiple GPUs, I had to select local-rank
for device-id and cuda-aware for mpi-type. When exactly should i be
using round-robin and local-rank? And when should i be using standard or

If the GPUs in your system are in compute exclusive mode then
round-robin is probably what you want. Otherwise, opt for local-rank.
So long as each rank gets its own GPU there should be no impact on

In terms of the mpi-type this depends heavily on the hardware you're
running on and the MPI library you're using. If your MPI library is
CUDA aware then setting mpi-type = cuda-aware can improve performance.

How would you select GiMMiK cutoff? How does it affect accuracy /

Some experimentation is needed here as the optimal value depends on the
element types you're using, if anti-aliasing is enabled, and the CPU
that you are running on.

I believe block-1d and block-2d are determined by GPU's specification. I
am not very familiar with Cuda. Please someone can elaborate a bit. For
example I am running pyfr with two Tesla k80s in parallel, what's the
block size for 1d and 2d pointswise kernels?

You should seldom need to modify either of these two values. On some
pathological meshes reducing block-1d can improve performance, but not
by a lot.

Regards, Freddie.

Thanks Freddie,

So when starting a run, do you usually play with the GiMMiK cutoff a bit to find the most optimized value (does it influence the performance significantly / worth the effort of finding the optimized value)? What’s the range of this value? Is a power of 2 (example uses 512) somewhat beneficial?


Hi Junting,

The values I would try are 0 (disables GiMMiK), 512 (the default), and
8192. There is nothing special about the number being a power of two.

Regards, Freddie.

6 posts were split to a new topic: Cuda backend error ‘CudaOutOfMemory’ ‘CudaInvaildDevice’