Hi,
I am running PyFR on CUDA backend. My cluster has 8 A100 GPUs and two CPUs each node (8 numa nodes). What I want to do is to assign each CPU core with its own GPU to reach the best performance (I have tried to run it directly with 8 partitions within one node but it is not with the best performance). After some googling, here is the procedure to do it in one node:
- run nvidia-smi topo -m to get topology of my machine which is like this:
ESC[4mGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA AffinityESC[0m
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
mlx5_2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS
mlx5_3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS
mlx5_4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
mlx5_5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
mlx5_6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS
mlx5_7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS
mlx5_8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
mlx5_9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
- after getting CPUs I wrote a script called script_gpu with taskset
#!/bin/bash
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
case $CUDA_VISIBLE_DEVICES in
0)
CORES=48-63
;;
1)
CORES=48-63
;;
2)
CORES=16-31
;;
3) CORES=16-31
;;
4) CORES=112-127
;;
5) CORES=112-127
;;
6) CORES=80-95
;;
7) CORES=80-95
;;
esac
#echo $CUDA_VISIBLE_DEVICES $CORES
taskset -c $CORES $@
- run my job with mpi
mpirun -np 8 ./script_gpu pyfr ..........
This is working good. However, recently I need to run this with more nodes (ie two nodes). I changed my command to srun -n 16 --mpi=pmix --cpunobind=none ./script_gpu pyfr ....
but this is not working. I tried srun without script which is working but not working with best performance. Advised by some discussions online I tried numactl rather than taskset but also not working. Here is the record of what I have tried so far:
command errors
mpirun -np 8 with taskset -c $CORES $@ working
mpirun -np 8 with numactl --physcpubind=$CORES $@ libnuma: Warning: cpu argument 48-63 is out of range
mpirun -np 8 with numactl --cpunodebind=$CORES $@ libnuma: Warning: node argument 48 is out of range
srun -n 8 with --cpu-bind=none --mpi=pmix with taskset -c $CORES $@ taskset: failed to parse CPU list: pyfr
srun -n 8 with --cpu-bind=none --mpi=pmix with numactl --physcpubind=$CORES $@ sched_setaffinity: Invalid argument
srun -n 8 with --cpu-bind=none --mpi=pmix with numactl --cpunodebind=$CORES $@ numa_sched_setaffinity_v2_int() failed: Invalid argument sched_setaffinity: Invalid argument
Does anyone have comments on this? This is getting more and more annoying and get beyond my capability. This topic could be mentioned before but I really want to know exact approach on your own clusters. Can anyone give me some detailed explanation? Really appreciate any answer.
Best wishes,
Zhenyang