I am curious about how to use the multiple CUDA GPU cards in a single
node. Every-time I use MPI with CUDA, there is only one GPU per node
could be used.
I have checked out some information online, and find that, this problem
seems related with the MPI implementation adopted (For me, I used Intel
MPI), as well as how the program itself is written.
I am curious about how to use the multiple CUDA GPU cards in a
single node. Every-time I use MPI with CUDA, there is only one GPU
per node could be used.
I have checked out some information online, and find that, this
problem seems related with the MPI implementation adopted (For me,
I used Intel MPI), as well as how the program itself is written.
By default the CUDA backend uses a 'round-robin' strategy to decide
which GPU to use. The strategy tries to create a CUDA context on each
CUDA capable device in the system until one succeeds. It is intended
to be used when the GPUs are in 'compute exclusive' mode.
Alternatively, you can set the device-id key in [backend-cuda] to be
'local-rank'. Here PyFR will use the node-local MPI rank to determine
which CUDA device to use.
Further information on these options can be found in the user guide.
Thanks a lot, but you may misunderstand my idea. Assuming I have two CUDA GPUs on a single node, and I want to partition a mesh into two parts and solve two partition on two GPUs simultaneously.
I can set the devid, but it is a sole number, which targets at only one GPU. But for local-rank, MPI ( I used MVAPICH2) gives all two processes to a single card.
Thanks a lot, but you may misunderstand my idea. Assuming I have two
CUDA GPUs on a single node, and I want to partition a mesh into two
parts and solve two partition on two GPUs simultaneously.
This is exactly what you want:
[backend-cuda]
device-id = local-rank
for. (Or use the compute-exclusive solution as outlined by Brian.)
I can set the `devid`, but it is a sole number, which targets at only
one GPU. But for local-rank, MPI ( I used MVAPICH2) gives all two
processes to a single card.
Can you attach the config file you're using?
Also, as an aside it is perhaps also worth noting that almost all config
file options support expansion of environmental variables. So with
MVAPICH2:
I tried Brian’s solution (use nvidia-smi to set the mode) and it works. I searched a lot about ranks and binding but the solution turned out to be so simple (and a little hack maybe? since it also eliminated the possibility of run multiple processes evenly on multiple cards)
And as a (maybe) off-topic discussion, take this solution as an example, how is the communication between processes done? Will the data from one card be move t the main memory and moved towards another cards? Or the cards can go the PCIe Bus directly?
P.S. The direct communication seems to be called the GPUDirect technology.
And as a (maybe) off-topic discussion, take this solution as an
example, how is the communication between processes done? Will the
data from one card be move t the main memory and moved towards
another cards? Or the cards can go the PCIe Bus directly?
P.S. The direct communication seems to be called the GPUDirect
technology.
There is a fork of PyFR that can exploit CUDA-aware-MPI. This permits
PyFR to pass CUDA device pointers directly to MPI functions and have the
MPI library handle the copying. This may, or may not, use peer-to-peer
GPU copies or perhaps even GPUDirect over RDMA.
If there is any performance benefit depends heavily on the problem being
solved, the platform it is being solved on, and the MPI library. As
PyFR already tries very hard to overlap communication with computation,
and because for large transfer sizes most MPI libraries will fall back
to copying via the host, the benefit is usually quite small.