Assigning each CPU with a GPU

Hi,

I am running PyFR on CUDA backend. My cluster has 8 A100 GPUs and two CPUs each node (8 numa nodes). What I want to do is to assign each CPU core with its own GPU to reach the best performance (I have tried to run it directly with 8 partitions within one node but it is not with the best performance). After some googling, here is the procedure to do it in one node:

  1. run nvidia-smi topo -m to get topology of my machine which is like this:
ESC[4mGPU0      GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  CPU Affinity    NUMA AffinityESC[0m
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5
mlx5_0  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS             
mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS             
mlx5_2  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS             
mlx5_3  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS             
mlx5_4  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS             
mlx5_5  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS             
mlx5_6  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS             
mlx5_7  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS             
mlx5_8  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX             
mlx5_9  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X              
  1. after getting CPUs I wrote a script called script_gpu with taskset
#!/bin/bash
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK

case $CUDA_VISIBLE_DEVICES in

        0)
                CORES=48-63
                ;;
        1)
                CORES=48-63
                ;;
        2) 
                CORES=16-31
                ;;
        3)      CORES=16-31
                ;;      
        4)      CORES=112-127
                ;;
        5)      CORES=112-127
                ;;
        6)      CORES=80-95
                ;;
        7)      CORES=80-95
                ;;
esac

#echo $CUDA_VISIBLE_DEVICES $CORES
taskset -c $CORES $@   
  1. run my job with mpi
mpirun -np 8 ./script_gpu pyfr ..........

This is working good. However, recently I need to run this with more nodes (ie two nodes). I changed my command to srun -n 16 --mpi=pmix --cpunobind=none ./script_gpu pyfr .... but this is not working. I tried srun without script which is working but not working with best performance. Advised by some discussions online I tried numactl rather than taskset but also not working. Here is the record of what I have tried so far:

command                                                                                                      errors
mpirun -np 8 with taskset -c $CORES $@                                           working        
mpirun -np 8 with numactl --physcpubind=$CORES $@                         libnuma: Warning: cpu argument 48-63 is out of range
mpirun -np 8 with numactl --cpunodebind=$CORES $@                         libnuma: Warning: node argument 48 is out of range 

srun -n 8 with --cpu-bind=none --mpi=pmix with taskset -c $CORES $@       taskset: failed to parse CPU list: pyfr
srun -n 8 with --cpu-bind=none --mpi=pmix with numactl --physcpubind=$CORES $@       sched_setaffinity: Invalid argument
srun -n 8 with --cpu-bind=none --mpi=pmix with numactl --cpunodebind=$CORES $@       numa_sched_setaffinity_v2_int() failed: Invalid argument       sched_setaffinity: Invalid argument

Does anyone have comments on this? This is getting more and more annoying and get beyond my capability. This topic could be mentioned before but I really want to know exact approach on your own clusters. Can anyone give me some detailed explanation? Really appreciate any answer.

Best wishes,
Zhenyang

Based on your topology, the assignment between CPU cores and GPUs really should not be all that important. The NV12 output indicates that the GPUs are connected directly to each other via NVLINK. Further, each GPU has direct access to an IB adapter (as per the PXB output). This is the ideal case for CUDA aware MPI.

Hence, what you want to focus on is ensuring that each pyfr instance uses the appropriate IB adapter (5_0, 5_1 for GPU0 and GPU1, 5_2, 5_2 for GPU2 and GPU3, etc) and that MPI is properly CUDA aware. This should give you the best overall performance.

Regards, Freddie.

Hi Freddie, thanks for your reply. I am not sure how do that exactly. Can you indicate it more detailly?

And according to our last discussion PyFR 1.13.0: CUDAOSError - #14 by Zhenyang, I recompiled my mpi library without cuda support. Should I turn back to CUDA aware MPI?

Best wishes,
Zhenyang

You will first want to compile UCX by hand with support for CUDA and Infiniband. It is important that the version of CUDA you compile against be identical to the version you want to run PyFR with.

Next, compile OpenMPI with support for CUDA and the UCX library you just built. Again, the CUDA versions must match. Then, recompile mpi4py against this version of OpenMPI. It is very important that mpi4py be recompiled and not just reinstalled from a cached prebuilt binary. If you are using parallel HDF5 it is also necessary to rebuild HDF5 against this new MPI distribution and then h5py against this new version of HDF5.

Finally, run PyFR and check everything still works. If so, set the mpi-type = cuda-aware in the [backend-cuda] block and assuming everything worked transfers between GPUs should now be peer-to-peer.

Regards, Freddie.

@Zhenyang,
You can (hopefully) streamline the approach outlined by Freddie by using Spack to take care of building and linking these packages correctly.

This oneliner should build the latest PyFR with OpenMPI, supporting CUDA through UCX:
spack install py-pyfr +cuda ^openmpi +cuda fabrics=ucx ^ucx +rc +ud +dc +cuda ^cuda@11.6.1

Make sure to specify a compatible CUDA version for your NVIDA driver (and also CUDA architecture for your card).

Let me know if this was successfull for you.

2 Likes

Hi,

Thanks for reply. Actually I have successfully compiled pyfr with streamline outlined by Freddie. And performance is with my expectation. It is I trying to push it to use more nodes. At present, I am using mpirun but I have to use Slurm srun when using more nodes. To do this, I need to use srun --mpi=pmix pyfr run .... Every time I run this, PyFR cuda cannot find valid device. I am thinking report back once I resolve this problem. But thank you, I can try it on the next installation.

Best wishes,
Zhenyang

1 Like