Hi,
I recently installed 4 gpus.(RTX 3090 ti)
It was confirmed that the analysis proceeded in one gpu using device-id=0
In addition, Simulation with “mpirun - 2” or “mpirun np -4” + "device-id = local-rank " commands was also calculated in parallel well.
However, there are questions about various parallelization methods.
Question 1.
The cuda out of memory error occurred when [mpirun -np2 + local rank] was additionally executed while [mpirun -np2 + local rank] was running.
I want to use other 2gpus while using 2 gpus.
As a result of the check, among gpu 0, 1, 2, and 3, 0 and 3 were not used, and the calculation was performed only on gpu 1 and 2.
For example, First run case’s memory was over 90% of gpu 1 and gpu 2, then i ran new simulation case with [“mpirun -np 2” with “device-id=local rank”] commands. I got error message.
What I want is to automatically select and use the remaining gpus when performing additional calculations among the two gpu usage. How is this possible? how?
Question 2.
Assuming that 2 gpus (gpu0, gpu1) use 50% memory each,
how can I perform additional parallel calculations on all gpus (0, 1, 2, and 3) using the “mpirun -np 4” command?
Question 3
A single gpu calculation is possible like “device-id=0” command.
In the same way, how can I make it run in the several gpus specified by device-id = 0,1 (example)?
Additional question,
when i set mpi-type = cuda-aware, I got error message.
[backend-cuda]
device-id = local-rank or 0 ..
mpi-type = cuda-aware
How can i fix it? or Is there anything else I need to install?
(pyfr14) sk@sk-ubuntu:~/python3.9/pyfr14/airfoil/ex$ mpirun -np 2 pyfr run -b cuda -p sd7003_p4_par2.pyfrm sd7003_aoa4.ini
[sk-ubuntu:11479] Read -1, expected 1873600, errno = 14
[sk-ubuntu:11478] Read -1, expected 1873600, errno = 14
[sk-ubuntu:11478] *** Process received signal ***
[sk-ubuntu:11478] Signal: Segmentation fault (11)
[sk-ubuntu:11478] Signal code: Invalid permissions (2)
[sk-ubuntu:11478] Failing at address: 0x7fbeff400000
[sk-ubuntu:11479] *** Process received signal ***
[sk-ubuntu:11479] Signal: Segmentation fault (11)
[sk-ubuntu:11479] Signal code: Invalid permissions (2)
[sk-ubuntu:11479] Failing at address: 0x7f232d400000
[sk-ubuntu:11478] [ 0] [sk-ubuntu:11479] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7fc4389a9f10]
[sk-ubuntu:11478] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f28669ccf10]
[sk-ubuntu:11479] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18e9df)[0x7fc438af99df]
[sk-ubuntu:11478] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(+0x2ca8)[0x7fc417095ca8]
[sk-ubuntu:11478] /lib/x86_64-linux-gnu/libc.so.6(+0x18e9df)[0x7f2866b1c9df]
[sk-ubuntu:11479] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(+0x2ca8)[0x7f28539c5ca8]
[sk-ubuntu:11479] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1aa)[0x7f2852d992fa]
[sk-ubuntu:11479] [ 4] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1aa)[0x7fc4164692fa]
[sk-ubuntu:11478] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x2af)[0x7fc416460b6f]
[sk-ubuntu:11478] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x2af)[0x7f2852d90b6f]
[sk-ubuntu:11479] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x8f)[0x7f28539c751f]
[sk-ubuntu:11479] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(+0x4a04)[0x7f28539c7a04]
[sk-ubuntu:11479] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x8f)[0x7fc41709751f]
[sk-ubuntu:11478] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_btl_vader.so(+0x4a04)[0x7fc417097a04]
[sk-ubuntu:11478] [ 7] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_progress+0x5c)[0x7f2860f859ec]
[sk-ubuntu:11479] [ 8] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_progress+0x5c)[0x7fc432f629ec]
[sk-ubuntu:11478] [ 8] /usr/lib/x86_64-linux-gnu/libmpi.so.20(ompi_request_default_wait_all+0x2e5)[0x7f28614f93f5]
[sk-ubuntu:11479] [ 9] /usr/lib/x86_64-linux-gnu/libmpi.so.20(ompi_request_default_wait_all+0x2e5)[0x7fc4334d63f5]
[sk-ubuntu:11478] [ 9] /usr/lib/x86_64-linux-gnu/libmpi.so.20(PMPI_Waitall+0x8f)[0x7f286153062f]
[sk-ubuntu:11479] [10] /home/sk/python3.9/pyfr14/lib/python3.9/site-packages/mpi4py/MPI.cpython-39-x86_64-linux-gnu.so(+0x7ba95)[0x7f2861820a95]
[sk-ubuntu:11479] [11] /usr/lib/x86_64-linux-gnu/libmpi.so.20(PMPI_Waitall+0x8f)[0x7fc43350d62f]
[sk-ubuntu:11478] [10] /home/sk/python3.9/pyfr14/lib/python3.9/site-packages/mpi4py/MPI.cpython-39-x86_64-linux-gnu.so(+0x7ba95)[0x7fc4337fda95]
[sk-ubuntu:11478] [11] /home/sk/python3.9/pyfr14/bin/python3.9[0x55c334]
[sk-ubuntu:11478] [12] /home/sk/python3.9/pyfr14/bin/python3.9[0x55c334]
[sk-ubuntu:11479] [12] /home/sk/python3.9/pyfr14/bin/python3.9(_PyObject_MakeTpCall+0x32c)[0x5f24dc]
[sk-ubuntu:11479] [13] /home/sk/python3.9/pyfr14/bin/python3.9(_PyObject_MakeTpCall+0x32c)[0x5f24dc]
[sk-ubuntu:11478] [13] /home/sk/python3.9/pyfr14/bin/python3.9(_PyEval_EvalFrameDefault+0x472b)[0x5f978b]
[sk-ubuntu:11479] [14] /home/sk/python3.9/pyfr14/bin/python3.9(_PyEval_EvalFrameDefault+0x472b)[0x5f978b]
[sk-ubuntu:11478] [14] /home/sk/python3.9/pyfr14/bin/python3.9[0x5f4322]
[sk-ubuntu:11479] [15] /home/sk/python3.9/pyfr14/bin/python3.9[0x5f4322]
[sk-ubuntu:11478] [15] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x1d4)[0x5ecdb4]
[sk-ubuntu:11478] [16] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x1d4)[0x5ecdb4]
[sk-ubuntu:11479] [16] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [17] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [17] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [18] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [18] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [19] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [19] /home/sk/python3.9/pyfr14/bin/python3.9[0x5f4322]
[sk-ubuntu:11478] [20] /home/sk/python3.9/pyfr14/bin/python3.9[0x5f4322]
[sk-ubuntu:11479] [20] /home/sk/python3.9/pyfr14/bin/python3.9[0x508d76]
[sk-ubuntu:11478] [21] /home/sk/python3.9/pyfr14/bin/python3.9[0x508d76]
[sk-ubuntu:11479] [21] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f3180]
[sk-ubuntu:11479] [22] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f3180]
[sk-ubuntu:11478] [22] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [23] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [23] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [24] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [24] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [25] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [25] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [26] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [26] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [27] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [27] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] [28] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] [28] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11479] [29] /home/sk/python3.9/pyfr14/bin/python3.9(_PyFunction_Vectorcall+0x104)[0x5ecce4]
[sk-ubuntu:11478] [29] /home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11479] *** End of error message ***
/home/sk/python3.9/pyfr14/bin/python3.9[0x4f4ee8]
[sk-ubuntu:11478] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node sk-ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Thanks
Su Kyung Yoon