About running PyFR on multiple GPUs

fdw · 15 April 2015 10:36

Hi, Freddie!

I am curious about how to use the multiple CUDA GPU cards in a single
node. Every-time I use MPI with CUDA, there is only one GPU per node
could be used.

I have checked out some information online, and find that, this problem
seems related with the MPI implementation adopted (For me, I used Intel
MPI), as well as how the program itself is written.

Thanks for any help!

Zhen

fdw · 15 April 2015 10:41

Hi Zhen,

I am curious about how to use the multiple CUDA GPU cards in a
single node. Every-time I use MPI with CUDA, there is only one GPU
per node could be used.

I have checked out some information online, and find that, this
problem seems related with the MPI implementation adopted (For me,
I used Intel MPI), as well as how the program itself is written.

By default the CUDA backend uses a 'round-robin' strategy to decide
which GPU to use. The strategy tries to create a CUDA context on each
CUDA capable device in the system until one succeeds. It is intended
to be used when the GPUs are in 'compute exclusive' mode.

Alternatively, you can set the device-id key in [backend-cuda] to be
'local-rank'. Here PyFR will use the node-local MPI rank to determine
which CUDA device to use.

Further information on these options can be found in the user guide.

Regards, Freddie.

Zhen_Zhang · 17 April 2015 14:53

Hi Freddie,

Thanks a lot, but you may misunderstand my idea. Assuming I have two CUDA GPUs on a single node, and I want to partition a mesh into two parts and solve two partition on two GPUs simultaneously.

I can set the devid, but it is a sole number, which targets at only one GPU. But for local-rank, MPI ( I used MVAPICH2) gives all two processes to a single card.

Thank you a lot!

bvermeir · 17 April 2015 14:56

Hi Zhen,

It could be the case that you do not have your cards set to “compute exclusive mode”. If you check nvidia-smi -h you should see the following option:

-c, --compute-mode= Set MODE for compute applications:
0/DEFAULT, 1/EXCLUSIVE_THREAD,
2/PROHIBITED, 3/EXCLUSIVE_PROCESS

You should try running nvidia-smi -c 3 to allow only one process per card.

fdw · 17 April 2015 15:12

Hi Zhen,

Thanks a lot, but you may misunderstand my idea. Assuming I have two
CUDA GPUs on a single node, and I want to partition a mesh into two
parts and solve two partition on two GPUs simultaneously.

This is exactly what you want:

[backend-cuda]
device-id = local-rank

for. (Or use the compute-exclusive solution as outlined by Brian.)

I can set the `devid`, but it is a sole number, which targets at only
one GPU. But for local-rank, MPI ( I used MVAPICH2) gives all two
processes to a single card.

Can you attach the config file you're using?

Also, as an aside it is perhaps also worth noting that almost all config
file options support expansion of environmental variables. So with
MVAPICH2:

[backend-cuda]
device-id = ${MV2_COMM_WORLD_LOCAL_RANK}

is basically the same as local-rank.

Regards, Freddie.

p.vincent · 17 April 2015 15:13

Hi Zhen,

As Freddie and Brian suggest, I would try using round-robin mode (the default), but with your GPUs in compute exclusive mode.

Cheers

Peter

Zhen_Zhang · 17 April 2015 15:57

Thank you, Brian and Vincent!

I tried Brian’s solution (use nvidia-smi to set the mode) and it works. I searched a lot about ranks and binding but the solution turned out to be so simple (and a little hack maybe? since it also eliminated the possibility of run multiple processes evenly on multiple cards)

And as a (maybe) off-topic discussion, take this solution as an example, how is the communication between processes done? Will the data from one card be move t the main memory and moved towards another cards? Or the cards can go the PCIe Bus directly?

P.S. The direct communication seems to be called the GPUDirect technology.

Thanks!

Zhen

fdw · 17 April 2015 16:05

Hi Zhen,

And as a (maybe) off-topic discussion, take this solution as an
example, how is the communication between processes done? Will the
data from one card be move t the main memory and moved towards
another cards? Or the cards can go the PCIe Bus directly?

P.S. The direct communication seems to be called the GPUDirect
technology.

There is a fork of PyFR that can exploit CUDA-aware-MPI. This permits
PyFR to pass CUDA device pointers directly to MPI functions and have the
MPI library handle the copying. This may, or may not, use peer-to-peer
GPU copies or perhaps even GPUDirect over RDMA.

If there is any performance benefit depends heavily on the problem being
solved, the platform it is being solved on, and the MPI library. As
PyFR already tries very hard to overlap communication with computation,
and because for large transfer sizes most MPI libraries will fall back
to copying via the host, the benefit is usually quite small.

Regards, Freddie.

Topic		Replies	Views
How to run Multi-GPU per node node with PyFR General	11	406	20 April 2017
Running PyFR on multiple GPUs on HPC cluster Cases hpc	10	148	18 July 2024
Error Running PyFR Across Multiple Nodes Errors	3	42	5 September 2024
MPI run error with cuda General	9	378	27 June 2022
Multi GPU calculation Cases hpc	5	330	10 April 2023

About running PyFR on multiple GPUs

Related topics