GPU ID and GPU multiply tasking

Hi,

I recently installed 4 gpus.(RTX 3090 ti)
It was confirmed that the analysis proceeded in one gpu using device-id=0
In addition, Simulation with “mpirun - 2” or “mpirun np -4” + "device-id = local-rank " commands was also calculated in parallel well.
gpu

However, there are questions about various parallelization methods.

Question 1.
setting

The cuda out of memory error occurred when [mpirun -np2 + local rank] was additionally executed while [mpirun -np2 + local rank] was running.
I want to use other 2gpus while using 2 gpus.

As a result of the check, among gpu 0, 1, 2, and 3, 0 and 3 were not used, and the calculation was performed only on gpu 1 and 2.
For example, First run case’s memory was over 90% of gpu 1 and gpu 2, then i ran new simulation case with [“mpirun -np 2” with “device-id=local rank”] commands. I got error message.

What I want is to automatically select and use the remaining gpus when performing additional calculations among the two gpu usage. How is this possible? how?
question1

Question 2.
Assuming that 2 gpus (gpu0, gpu1) use 50% memory each,
how can I perform additional parallel calculations on all gpus (0, 1, 2, and 3) using the “mpirun -np 4” command?
question2

Question 3
A single gpu calculation is possible like “device-id=0” command.
In the same way, how can I make it run in the several gpus specified by device-id = 0,1 (example)?

Additional question,
when i set mpi-type = cuda-aware, I got error message.

[backend-cuda]
device-id = local-rank or 0 ..
mpi-type = cuda-aware

How can i fix it? or Is there anything else I need to install?

(pyfr14) sk@sk-ubuntu:~/python3.9/pyfr14/airfoil/ex$  mpirun -np 2 pyfr run -b cuda -p sd7003_p4_par2.pyfrm sd7003_aoa4.ini
[sk-ubuntu:11479] Read -1, expected 1873600, errno = 14
[sk-ubuntu:11478] Read -1, expected 1873600, errno = 14
[sk-ubuntu:11478] *** Process received signal ***
[sk-ubuntu:11478] Signal: Segmentation fault (11)
[sk-ubuntu:11478] Signal code: Invalid permissions (2)
[sk-ubuntu:11478] Failing at address: 0x7fbeff400000
[sk-ubuntu:11479] *** Process received signal ***
[sk-ubuntu:11479] Signal: Segmentation fault (11)
[sk-ubuntu:11479] Signal code: Invalid permissions (2)
[sk-ubuntu:11479] Failing at address: 0x7f232d400000
[sk-ubuntu:11478] [ 0] [sk-ubuntu:11479] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7fc4389a9f10]
[sk-ubuntu:11478] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f28669ccf10]
[sk-ubuntu:11479] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18e9df)[0x7fc438af99df]
[sk-ubuntu:11478] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(+0x2ca8)[0x7fc417095ca8]
[sk-ubuntu:11478] /lib/x86_64-linux-gnu/libc.so.6(+0x18e9df)[0x7f2866b1c9df]
[sk-ubuntu:11479] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(+0x2ca8)[0x7f28539c5ca8]
[sk-ubuntu:11479] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1aa)[0x7f2852d992fa]
[sk-ubuntu:11479] [ 4] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1aa)[0x7fc4164692fa]
[sk-ubuntu:11478] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x2af)[0x7fc416460b6f]
[sk-ubuntu:11478] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x2af)[0x7f2852d90b6f]
[sk-ubuntu:11479] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x8f)[0x7f28539c751f]
[sk-ubuntu:11479] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(+0x4a04)[0x7f28539c7a04]
[sk-ubuntu:11479] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x8f)[0x7fc41709751f]
[sk-ubuntu:11478] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(+0x4a04)[0x7fc417097a04]
[sk-ubuntu:11478] [ 7] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_progress+0x5c)[0x7f2860f859ec]
[sk-ubuntu:11479] [ 8] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_progress+0x5c)[0x7fc432f629ec]
[sk-ubuntu:11478] [ 8] /usr/lib/x86_64-linux-gnu/libmpi.so.20(ompi_request_default_wait_all+0x2e5)[0x7f28614f93f5]
[sk-ubuntu:11479] [ 9] /usr/lib/x86_64-linux-gnu/libmpi.so.20(ompi_request_default_wait_all+0x2e5)[0x7fc4334d63f5]
[sk-ubuntu:11478] [ 9] /usr/lib/x86_64-linux-gnu/libmpi.so.20(PMPI_Waitall+0x8f)[0x7f286153062f]
[sk-ubuntu:11479] [10] /home/sk/python3.9/pyfr14/lib/python3.9/site-packages/mpi4py/MPI.cpython-39-x86_64-linux-gnu.so(+0x7ba95)[0x7f2861820a95]
[sk-ubuntu:11479] [11] /usr/lib/x86_64-linux-gnu/libmpi.so.20(PMPI_Waitall+0x8f)[0x7fc43350d62f]
[sk-ubuntu:11478] [10] /home/sk/python3.9/pyfr14/lib/python3.9/site-packages/mpi4py/MPI.cpython-39-x86_64-linux-gnu.so(+0x7ba95)[0x7fc4337fda95]
[sk-ubuntu:11478] [11] /home/sk/python3.9/pyfr14/bin/python3.9[0x55c334]
[sk-ubuntu:11478] [12] /home/sk/python3.9/pyfr14/bin/python3.9[0x55c334]
[sk-ubuntu:11479] [12] /home/sk/python3.9/pyfr14/bin/python3.9(_PyObject_MakeTpCall+0x32c)[0x5f24dc]
[sk-ubuntu:11479] [13] /home/sk/python3.9/pyfr14/bin/python3.9(_PyObject_MakeTpCall+0x32c)[0x5f24dc]
[sk-ubuntu:11478] [13] /home/sk/python3.9/pyfr14/bin/python3.9(_PyEval_EvalFrameDefault+0x472b)[0x5f978b]
[sk-ubuntu:11479] [14] /home/sk/python3.9/pyfr14/bin/python3.9(_PyEval_EvalFrameDefault+0x472b)[0x5f978b]
[sk-ubuntu:11478] [14] /home/sk/python3.9/pyfr14/bin/python3.9[0x5f4322]
[sk-ubuntu:11479] [15] /home/sk/python3.9/pyfr14/bin/python3.9[0x5f4322]
[sk-ubuntu:11478] [15] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x1d4)[0x5ecdb4]
[sk-ubuntu:11478] [16] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x1d4)[0x5ecdb4]
[sk-ubuntu:11479] [16] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [17] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [17] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [18] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [18] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [19] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [19] /home/sk/python3.9/pyfr14/bin/python3.9[0x5f4322]
[sk-ubuntu:11478] [20] /home/sk/python3.9/pyfr14/bin/python3.9[0x5f4322]
[sk-ubuntu:11479] [20] /home/sk/python3.9/pyfr14/bin/python3.9[0x508d76]
[sk-ubuntu:11478] [21] /home/sk/python3.9/pyfr14/bin/python3.9[0x508d76]
[sk-ubuntu:11479] [21] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f3180]
[sk-ubuntu:11479] [22] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f3180]
[sk-ubuntu:11478] [22] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [23] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [23] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [24] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [24] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [25] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [25] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [26] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [26] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [27] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [27] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [28] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [28] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [29] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [29] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] *** End of error message ***
/home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node sk-ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Thanks
Su Kyung Yoon

See CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Technical Blog. Setting this variable will enable you to hide specific devices (such as those already in use). Alternatively, you can put your devices into compute exclusive mode and then use the round-robin option in PyFR.

Running more than a single task on a GPU is almost always ill advised. The overheads associated with it are substantial.

For CUDA aware MPI you need an MPI library compiled with CUDA support. The specifics of this depend on what MPI distribution you are using (not all even have an option for CUDA awareness).

Regards, Freddie.

1 Like

@YoonSuKyung - Also, once your fleet of RTX 3090’s is ready to fly, unless you are debugging double precision kernels, make sure precision = single in [backend]. Sadly, fp64 is 64 times slower than fp32 on the GA102.

Then, is the difference between single and double results reasonable?
Have you ever compared the two differences in PyFR?

Yes, see:

Regards, Freddie.

1 Like

A post was split to a new topic: Transonic triangular areofoil fail