About running PyFR on multiple GPUs

Hi, Freddie!

I am curious about how to use the multiple CUDA GPU cards in a single
node. Every-time I use MPI with CUDA, there is only one GPU per node
could be used.

I have checked out some information online, and find that, this problem
seems related with the MPI implementation adopted (For me, I used Intel
MPI), as well as how the program itself is written.

Thanks for any help!

Zhen

Hi Zhen,

I am curious about how to use the multiple CUDA GPU cards in a
single node. Every-time I use MPI with CUDA, there is only one GPU
per node could be used.

I have checked out some information online, and find that, this
problem seems related with the MPI implementation adopted (For me,
I used Intel MPI), as well as how the program itself is written.

By default the CUDA backend uses a 'round-robin' strategy to decide
which GPU to use. The strategy tries to create a CUDA context on each
CUDA capable device in the system until one succeeds. It is intended
to be used when the GPUs are in 'compute exclusive' mode.

Alternatively, you can set the device-id key in [backend-cuda] to be
'local-rank'. Here PyFR will use the node-local MPI rank to determine
which CUDA device to use.

Further information on these options can be found in the user guide.

Regards, Freddie.

Hi Freddie,

Thanks a lot, but you may misunderstand my idea. Assuming I have two CUDA GPUs on a single node, and I want to partition a mesh into two parts and solve two partition on two GPUs simultaneously.

I can set the devid, but it is a sole number, which targets at only one GPU. But for local-rank, MPI ( I used MVAPICH2) gives all two processes to a single card.

Thank you a lot!

Hi Zhen,

It could be the case that you do not have your cards set to “compute exclusive mode”. If you check nvidia-smi -h you should see the following option:

-c, --compute-mode= Set MODE for compute applications:
0/DEFAULT, 1/EXCLUSIVE_THREAD,
2/PROHIBITED, 3/EXCLUSIVE_PROCESS

You should try running nvidia-smi -c 3 to allow only one process per card.

Hi Zhen,

Thanks a lot, but you may misunderstand my idea. Assuming I have two
CUDA GPUs on a single node, and I want to partition a mesh into two
parts and solve two partition on two GPUs simultaneously.

This is exactly what you want:

[backend-cuda]
device-id = local-rank

for. (Or use the compute-exclusive solution as outlined by Brian.)

I can set the `devid`, but it is a sole number, which targets at only
one GPU. But for local-rank, MPI ( I used MVAPICH2) gives all two
processes to a single card.

Can you attach the config file you're using?

Also, as an aside it is perhaps also worth noting that almost all config
file options support expansion of environmental variables. So with
MVAPICH2:

[backend-cuda]
device-id = ${MV2_COMM_WORLD_LOCAL_RANK}

is basically the same as local-rank.

Regards, Freddie.

Hi Zhen,

As Freddie and Brian suggest, I would try using round-robin mode (the default), but with your GPUs in compute exclusive mode.

Cheers

Peter

Thank you, Brian and Vincent!

I tried Brian’s solution (use nvidia-smi to set the mode) and it works. I searched a lot about ranks and binding but the solution turned out to be so simple (and a little hack maybe? since it also eliminated the possibility of run multiple processes evenly on multiple cards)

And as a (maybe) off-topic discussion, take this solution as an example, how is the communication between processes done? Will the data from one card be move t the main memory and moved towards another cards? Or the cards can go the PCIe Bus directly?

P.S. The direct communication seems to be called the GPUDirect technology.

Thanks!

Zhen

Hi Zhen,

And as a (maybe) off-topic discussion, take this solution as an
example, how is the communication between processes done? Will the
data from one card be move t the main memory and moved towards
another cards? Or the cards can go the PCIe Bus directly?

P.S. The direct communication seems to be called the GPUDirect
technology.

There is a fork of PyFR that can exploit CUDA-aware-MPI. This permits
PyFR to pass CUDA device pointers directly to MPI functions and have the
MPI library handle the copying. This may, or may not, use peer-to-peer
GPU copies or perhaps even GPUDirect over RDMA.

If there is any performance benefit depends heavily on the problem being
solved, the platform it is being solved on, and the MPI library. As
PyFR already tries very hard to overlap communication with computation,
and because for large transfer sizes most MPI libraries will fall back
to copying via the host, the benefit is usually quite small.

Regards, Freddie.