Running PyFR on multiple GPUs on HPC cluster

pyfr_starting_man · 11 July 2024 14:54

I have read through similar discussions on this, and I have seen to change device-id to local-rank for [backend-cuda] and to change GPUs to “compute exclusive mode”. However, I am running PyFR on supercomputer HPC cluster and do not have any sort of sudo access to change to compute exclusive mode. If someone could help me with a general idea as to how to run PyFR on multiple GPUs on supercomputer cluster it would be greatly appreciated. Maybe the only way is to submit a batch job rather than an interactive desktop. Perhaps modifying the beginning of my job script for 1 GPU would do the trick?:

#SBATCH --job-name="PyFR_Simulation"
#SBATCH --time=04:00:00
#SBATCH --output=myscript2.out
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --gpus-per-node=1
#SBATCH --account=#####
#SBATCH --mail-type=BEGIN,END,FAIL

WillT · 11 July 2024 14:57

I you want to run a job on a node exclusively with slurm, you normally add something like this:

#SBATCH --exclusive
#SBATCH --nodes={some integer}

Also you probably want to set

#SBATCH --ntasks-per-gpu=1

pyfr_starting_man · 11 July 2024 15:01

Thanks for the quick reply! I am in need of more GPUs, as I am running a large grid with only 1 GPU and getting a CUDA out of memory error. When I run it as a batch job and request multiple GPUs, do I set device-id = local-rank for [backend-cuda]?

Also, when partitioning, should I be partitioning the grid in the number of GPUs that I have or the number of cores? i.e. if I have 4 GPUs and 48 cores, would I partition my mesh in 48 or 4? I had always assumed to do number of cores and that is why I have ntasks-per-node set to 48.

WillT · 11 July 2024 15:17

The device ID will be system-dependent, but most of the time local-rank is correct.

When running on GPU, yes, you want one partition per GPU you intend to run on.

pyfr_starting_man · 11 July 2024 16:13

Thank you! Also, one quick unrelated question. When I was running some of PyFR’s test cases, I would sometimes get a nice progress bar that appeared below GPUFreq = control_disabled that updated the sim’s progress and ETA. However, most of the time I do not see it. Is there a way to have this always show up?

WillT · 12 July 2024 10:55

The progress bar is controlled by the -p flag, ie:

$ pyfr -p run ...

If you have that flag enabled but aren’t seeing the progress bar, then it has something to do with your environment rather than PyFR.

pyfr_starting_man · 12 July 2024 16:39

Thanks, that worked. Also, when running with multiple GPUs like we discussed earlier, I am seeing that only 1 of my GPUs is being used with local-rank. Here is what NVIDIA-smi shows me during one of the runs:

How can I get it to run with all 4 GPUs?

fdw · 12 July 2024 22:46

Try setting device-id = local-rank.

Regards, Freddie.

pyfr_starting_man · 12 July 2024 23:15

Yes, that is what I have it set at as I mentioned. Any other advice as how I can get it to run with all 4 GPUs? I am on a supercomputer cluster, so I do not have any sort of sudo access or anything. Thank you!

fdw · 13 July 2024 11:42

Okay, how does device-id = 0 work?

Regards, Freddie.

p.vincent · 18 July 2024 02:32

What system are you trying to run on? I would try setting --gpus-per-task=1. This should give each rank a GPU, which it sees as device 0. And then set device-id=0 in the .ini file.

Topic		Replies	Views
How to run Multi-GPU per node node with PyFR General	11	406	20 April 2017
About running PyFR on multiple GPUs General	7	279	17 April 2015
Error Running PyFR Across Multiple Nodes Errors	3	41	5 September 2024
Problem with CUDA backend General	2	455	26 October 2020
Assigning each CPU with a GPU General	5	653	13 April 2022

Running PyFR on multiple GPUs on HPC cluster

Related topics