Memory on CUDA GPU backend

Hi,

I am testing a big 3D jet case on new cluster (with NVIDIA A100 GPU, I think it is 40 Gb version). Since it has 8 GPUs and 2 CPUs in one node, I should use round-robin for device-id option, right? On each node, it has 1Tb RAM, but I kept having CUDA_OUT_OF_MEMORY error when I have up to 4 nodes.I dont have more allocation this month so I can’t go even higher. But is there anyway to show the memory requirement for this case from modifying some lines of codes? The jet case is simply this one A high-order cross-platform incompressible Navier–Stokes solver via artificial compressibility with application to a turbulent jet - ScienceDirect

Thanks in ahead,

best wishes,
zhenyang

Using a slightly out of date spreadsheet, I think there should be sufficient memory on two 40GB A100s to run the case. Although this doesn’t account for all the additional memory needed for P-multigrid.

Can you try:

  • Using rank-local with one core per gpu? If you are using slurm you can configure this pretty easily.
  • Running the case with compressible Navier-Stokes, just for a few time steps?

If you want to see the memory usage the easiest way is to either ssh into the node while the job is running and use nvidia-smi, or launch the job from an interactive job (srun --pty bash -i) and use nvidia-smi.

Otherwise you can add a print statement to the malloc in base/backend.py that will print out the size of the memory allocations as they happen.

Hi,

Thanks for your reply!

First I have switched to local-rank mode, I run it in the interactive mode with two cores but with same error. Here is nvidia-smi output while running:

Mon Oct 25 12:29:57 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0    55W / 400W |  22862MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   22C    P0    57W / 400W |  22932MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3261510      C   .../envs/pyfr_env/bin/python    22859MiB |
|    1   N/A  N/A   3261512      C   .../envs/pyfr_env/bin/python    22929MiB |
+-----------------------------------------------------------------------------+

It seems only half of memory is used but how can it lead to CUDA memory error?

Btw, I have print obj.nbytes in malloc function which gave me a brunch of number which seems no meaning to me. What’s the right variable to print in this case?

Best wishes,
zhenyang

What happens if you write a CUDA program which tries to allocate 30 GiB or so of memory? Does that work as expected?

Regards, Freddie.

Hi Freddie,

I have tried to write a CUDA program like this:

#include <cuda.h>
#include <stdio.h>
#include <cuda_runtime_api.h>
#include <chrono>
#include <thread>

int main()
{
const size_t SIZE = 8000000000;
float *abc;

cudaMalloc((void **)&abc, SIZE * sizeof(float));


using namespace std::this_thread;     // sleep_for, sleep_until
using namespace std::chrono_literals; // ns, us, ms, s, h, etc.
using std::chrono::system_clock;

sleep_for(100s);  // check nvidia-smi

return 0;
}

when I check nvidia-smi, it gave me:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   34C    P0    61W / 400W |  30932MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1261279      C   ./a.out                         30929MiB |
+-----------------------------------------------------------------------------+

I can allocate 30 gb memory, am I doing it in the right way?

Best wishes,
Zhenyang

That is exactly what I would expect to see.

Can you post the full backtrace from PyFR?

Regards, Freddie.

Hi Freddie,

as follow:

Traceback (most recent call last):
  File "/home/x_zhyua/.conda/envs/pyfr_env/bin/pyfr", line 33, in <module>
    sys.exit(load_entry_point('pyfr==1.12.2', 'console_scripts', 'pyfr')())
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/__main__.py", line 117, in main
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/__main__.py", line 250, in process_run
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/__main__.py", line 232, in _process_common
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/solvers/__init__.py", line 16, in get_solver
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/__init__.py", line 36, in get_integrator
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/phys/controllers.py", line 8, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/phys/base.py", line 17, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/pseudo/__init__.py", line 58, in get_pseudo_integrator
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/pseudo/multip.py", line 118, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/pseudo/pseudocontrollers.py", line 11, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/pseudo/base.py", line 48, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/solvers/base/system.py", line 36, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/backends/base/backend.py", line 102, in commit
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/backends/cuda/base.py", line 90, in _malloc_impl
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/backends/cuda/driver.py", line 282, in mem_alloc
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/backends/cuda/driver.py", line 134, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/ctypesutil.py", line 33, in _errcheck
pyfr.backends.cuda.driver.CUDAOutofMemory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/x_zhyua/.conda/envs/pyfr_env/bin/pyfr", line 33, in <module>
    sys.exit(load_entry_point('pyfr==1.12.2', 'console_scripts', 'pyfr')())
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/__main__.py", line 117, in main
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/__main__.py", line 250, in process_run
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/__main__.py", line 232, in _process_common
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/solvers/__init__.py", line 16, in get_solver
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/__init__.py", line 36, in get_integrator
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/phys/controllers.py", line 8, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/phys/base.py", line 17, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/pseudo/__init__.py", line 58, in get_pseudo_integrator
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/pseudo/multip.py", line 118, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/pseudo/pseudocontrollers.py", line 11, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/integrators/dual/pseudo/base.py", line 48, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/solvers/base/system.py", line 36, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/backends/base/backend.py", line 102, in commit
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/backends/cuda/base.py", line 90, in _malloc_impl
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/backends/cuda/driver.py", line 282, in mem_alloc
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/backends/cuda/driver.py", line 134, in __init__
  File "/home/x_zhyua/.conda/envs/pyfr_env/lib/python3.8/site-packages/pyfr-1.12.2-py3.8.egg/pyfr/ctypesutil.py", line 33, in _errcheck
pyfr.backends.cuda.driver.CUDAOutofMemory
[node056:2053728] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[node056:2053728] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Can you confirm there are no issues allocating this much system memory via calloc?

Can you give me a hint how to confirm that?

Hi all,

Update:

I tried to change configuration file to compressible flow and using single polynomial order rather than multi-p, it seems working now (P.S. compressible flow solver, multi-p still not working). So problem could be with the multi-p. In the end, two A100 GPU is used.

[Off topic question moved to new topic by @WillT]

Best wishes,
Zhenyang

The problem is unlikely to be with multi-p. Rather, when you enable multi-p we have to construct additional systems for the lower polynomial orders and these systems result in increased memory usage.

Regards, Freddie.

Hi Freddie,

Yes true, I manage to get more gpus involved, I can run it for a short period of time since I don’t have a lot allocation this month. But I have NaN error while running the configuration file out of box. The only thing I have changed is anti-aliasing parameter from div-flux to flux (and surf-flux is also tried) since I am using 12.2 version and I think div-flux is not supported anymore. Do you have any suggestions about this anti-aliasing problem?

Best wishes,
Zhenyang

That case, if I recall was one of the few which benefited from div-flux anti-aliasing due the use of a rather non-linear source term. You will need to adjust the mesh to get it work with current versions of PyFR. Or, alternatively, switch back to an older version of PyFR which has support for div-flux anti-aliasing.

Regards, Freddie.

Hi Freddie,

Sorry to open this question again. I still have problem with how to use GPU in optimal condition regarding my current cluster since I am very new to GPU computing. I have NVIDIA A100 8 GPU and 2 CPU in one node. I which way to use GPU is optimal? Allocating each CPU with its own GPU? Will multiple GPU sharing one CPU affect the efficiency? I have tested a big job of different number GPUs (2, 4, 8) with 2 CPUs (with local-rank option). The estimated wall time is not changing too much, not like halve or even 2/3 of wall time.

Another question is that what’s the ideal nodal points is most efficient for each GPU (or one could say how many elements). Before reaching the optimal, more nodal points for each GPU, it should be more efficient, right?

Thank you so much for answering those questions.

Cheers,
Zhenyang

PyFR generally needs one CPU core per GPU. Ideally the core should belong to a CPU which is on the same PCIe root complex as the GPU it will be using. Getting this right is the task of the job scheduler/MPI library.

A reasonable job size is ~20k elements per GPU.

If your job is reasonably sized you should see a good decrease in wall clock time as you add an additional GPUs.

Regards, Freddie.