Cuda backend parameters

Junting_Chen · 6 January 2020 17:34

Hello,

I am wondering if someone can provide a bit more descriptions on these parameters to optimize performance.

As far as I know, when using multiple GPUs, I had to select local-rank for device-id and cuda-aware for mpi-type. When exactly should i be using round-robin and local-rank? And when should i be using standard or cuda-aware?

How would you select GiMMiK cutoff? How does it affect accuracy / performance?

I believe block-1d and block-2d are determined by GPU’s specification. I am not very familiar with Cuda. Please someone can elaborate a bit. For example I am running pyfr with two Tesla k80s in parallel, what’s the block size for 1d and 2d pointswise kernels?

Parameterises the CUDA backend with

device-id — method for selecting which device(s) to run on:

int | round-robin | local-rank
gimmik-max-nnz — cutoff for GiMMiK in terms of the number of non-zero entires in a constant matrix:

int
mpi-type — type of MPI library that is being used:

standard | cuda-aware
block-1d — block size for one dimensional pointwise kernels:

int
block-2d — block size for two dimensional pointwise kernels:

int, int

Thanks a lot!

Junting Chen

fdw · 6 January 2020 22:18

Hi Junting,

As far as I know, when using multiple GPUs, I had to select local-rank
for device-id and cuda-aware for mpi-type. When exactly should i be
using round-robin and local-rank? And when should i be using standard or
cuda-aware?

If the GPUs in your system are in compute exclusive mode then
round-robin is probably what you want. Otherwise, opt for local-rank.
So long as each rank gets its own GPU there should be no impact on
performance.

In terms of the mpi-type this depends heavily on the hardware you're
running on and the MPI library you're using. If your MPI library is
CUDA aware then setting mpi-type = cuda-aware can improve performance.

How would you select GiMMiK cutoff? How does it affect accuracy /
performance?

Some experimentation is needed here as the optimal value depends on the
element types you're using, if anti-aliasing is enabled, and the CPU
that you are running on.

I believe block-1d and block-2d are determined by GPU's specification. I
am not very familiar with Cuda. Please someone can elaborate a bit. For
example I am running pyfr with two Tesla k80s in parallel, what's the
block size for 1d and 2d pointswise kernels?

You should seldom need to modify either of these two values. On some
pathological meshes reducing block-1d can improve performance, but not
by a lot.

Regards, Freddie.

Junting_Chen · 7 January 2020 17:40

Thanks Freddie,

So when starting a run, do you usually play with the GiMMiK cutoff a bit to find the most optimized value (does it influence the performance significantly / worth the effort of finding the optimized value)? What’s the range of this value? Is a power of 2 (example uses 512) somewhat beneficial?

Junting

fdw · 8 January 2020 14:32

Hi Junting,

The values I would try are 0 (disables GiMMiK), 512 (the default), and
8192. There is nothing special about the number being a power of two.

Regards, Freddie.

WillT · 23 May 2021 21:50

6 posts were split to a new topic: Cuda backend error ‘CudaOutOfMemory’ ‘CudaInvaildDevice’

Topic		Replies	Views
GPU ID and GPU multiply tasking Cases	5	557	6 September 2022
About running PyFR on multiple GPUs General	7	279	17 April 2015
GPU parallelization error, ranks not equal to gpu number Errors cuda , mpi	23	1266	20 September 2023
Cuda backend error ‘CudaOutOfMemory’ ‘CudaInvaildDevice’ General	9	573	18 July 2022
GPU parallelization and scalability Cases hpc , cuda	2	35	7 April 2025

Cuda backend parameters

Related topics