Running PyFR on multiple GPUs on HPC cluster

I have read through similar discussions on this, and I have seen to change device-id to local-rank for [backend-cuda] and to change GPUs to “compute exclusive mode”. However, I am running PyFR on supercomputer HPC cluster and do not have any sort of sudo access to change to compute exclusive mode. If someone could help me with a general idea as to how to run PyFR on multiple GPUs on supercomputer cluster it would be greatly appreciated. Maybe the only way is to submit a batch job rather than an interactive desktop. Perhaps modifying the beginning of my job script for 1 GPU would do the trick?:

#SBATCH --job-name="PyFR_Simulation"
#SBATCH --time=04:00:00
#SBATCH --output=myscript2.out
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --gpus-per-node=1
#SBATCH --account=#####
#SBATCH --mail-type=BEGIN,END,FAIL

I you want to run a job on a node exclusively with slurm, you normally add something like this:

#SBATCH --exclusive
#SBATCH --nodes={some integer}

Also you probably want to set

#SBATCH --ntasks-per-gpu=1

Thanks for the quick reply! I am in need of more GPUs, as I am running a large grid with only 1 GPU and getting a CUDA out of memory error. When I run it as a batch job and request multiple GPUs, do I set device-id = local-rank for [backend-cuda]?

Also, when partitioning, should I be partitioning the grid in the number of GPUs that I have or the number of cores? i.e. if I have 4 GPUs and 48 cores, would I partition my mesh in 48 or 4? I had always assumed to do number of cores and that is why I have ntasks-per-node set to 48.

The device ID will be system-dependent, but most of the time local-rank is correct.

When running on GPU, yes, you want one partition per GPU you intend to run on.

1 Like

Thank you! Also, one quick unrelated question. When I was running some of PyFR’s test cases, I would sometimes get a nice progress bar that appeared below GPUFreq = control_disabled that updated the sim’s progress and ETA. However, most of the time I do not see it. Is there a way to have this always show up?

The progress bar is controlled by the -p flag, ie:

$ pyfr -p run ...

If you have that flag enabled but aren’t seeing the progress bar, then it has something to do with your environment rather than PyFR.

Thanks, that worked. Also, when running with multiple GPUs like we discussed earlier, I am seeing that only 1 of my GPUs is being used with local-rank. Here is what NVIDIA-smi shows me during one of the runs:


How can I get it to run with all 4 GPUs?

Try setting device-id = local-rank.

Regards, Freddie.

Yes, that is what I have it set at as I mentioned. Any other advice as how I can get it to run with all 4 GPUs? I am on a supercomputer cluster, so I do not have any sort of sudo access or anything. Thank you!

Okay, how does device-id = 0 work?

Regards, Freddie.

What system are you trying to run on? I would try setting --gpus-per-task=1. This should give each rank a GPU, which it sees as device 0. And then set device-id=0 in the .ini file.