Cuda backend error ‘CudaOutOfMemory’ ‘CudaInvaildDevice’

Hi Freddie,

Following by this question, I am trying to reproduce the jet simulation from this paper: https://data.mendeley.com/datasets/65m665nt9c/1
However when I tried device-id as round-robin, it worked with examples but it gave me ‘CudaOutOfMemory’ error in jet simulation. When I tried local-rank, it gave me ‘CUDAInvalidDevice’. I basically have no knowledge on gpu computing, can you give some advice on it?

Cheers,
Zhenyang

The baseline simulation was run on 96 NVIDIA P100 GPUs. As such, in order to reproduce the simulation data in a reasonable period of time you’ll need a similar level of hardware. What exactly are you running with?

Regards, Freddie.

Hi Freddie,

I am using 96 Telsa K80 GPUs

Cheers Zhenyang

48 K80s (each with two GPUs for a total of 96 ranks) should certainly be enough. As such it is likely that your problem is related to assigning MPI ranks to GPUs. This is something that your system administrator should be able to help you with. (On the PyFR side you want to be using device-id = local-rank.)

Regards, Freddie.

Hi Freddie,

Thanks for your great help. I have fixed it now. But the original mesh has been partitioned for 96 parts. Do you know how to assemble it or do you have the unpartitioned mesh for this case?

Cheers,
Zhenyang

As per the user guide the partition command can be used to repartition meshes and grids into any number of pieces. For example

pyfr partition 8 mesh.pyfrm soln1.pyfrs soln2.pyfrs outd/

will repartition the mesh and two solution files into 8 pieces and save the output into the directory outd.

Regards, Freddie.

@fdw @Zhenyang
Hi, Freddie and Zhenyang, I got the same error when I tried to reproduce the naca0021 simulation following this work [1]. (I used its supplemental materials with PyFR1.14.0)

I first converted the Gmsh mesh file into a PyFR mesh file by
pyfr import naca0021_p4_7c.msh naca0021_p4_7c.pyfrm

And then start the simulation by
pyfr run -b cuda -p naca0021_p4_7c.pyfrm p4_aa_01.ini

Then it produced pyfr.backends.cuda.driver.CUDAOutofMemory.

Since I hardly know the GPU computing, could you give me some suggestions, please?

This may be helpful.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 00000000:03:00.0 Off |                    0 |
| 23%   38C    P8    19W / 235W |     13MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P5000        Off  | 00000000:04:00.0  On |                  Off |
| 31%   48C    P8     9W / 180W |    970MiB / 16277MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1246      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2067      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1246      G   /usr/lib/xorg/Xorg                 39MiB |
|    1   N/A  N/A      2067      G   /usr/lib/xorg/Xorg                377MiB |
|    1   N/A  N/A      2198      G   /usr/bin/gnome-shell              199MiB |
|    1   N/A  N/A      9978      G   ...998703167565289414,131072      152MiB |
|    1   N/A  N/A     11559      G   ...RendererForSitePerProcess      155MiB |
|    1   N/A  N/A     35160      G   ...araView/bin/paraview-real       31MiB |
+-----------------------------------------------------------------------------+

[1] Park, J. S., Witherden, F. D., & Vincent, P. E. (2017). High-order implicit large-eddy simulations of flow over a NACA0021 aerofoil. AIAA Journal , 55 (7), 2186-2197.

Best regards,
Hongwei

That case has a working set of about 20 GiB. This is more than the amount of memory on your GPUs.

Regards, Freddie.

Well, thank you. So I can only run this case with CPU :smiling_face_with_tear:

I’m writting this here as it as good a place as any and will hopefully aid people in the future. The NACA deep stall case is somewhat difficult to run for a number of reasons.

  1. In deep stall there are some very large flow features, consequently the span required to not artifical enforce periodicity is very large. For example, for a low pressure turbine blade a span 20% of chord is normally considered to be acceptable in many circumstances. However, in the paper cited it seems that somewhere between 2 and 4 times the chord was needed.

  2. The case has quite high memory requirments. As a result of the previous point, this cause is quite large in terms of total number of degrees of freedom. This is also compounded by the case having large seperated dynamics, and so there is a large region behind the suction surface where the mesh resolution is higher than what might normally be seen in a fully attached flow.

  3. The time period of the dynamic is quite low, and so has to be run for a long time in order to generate meaningful time averages. This is a consequence of the shedding that is occuring this means that a significant amount of computational resouces are needed if a full run is to be attempted. As either case needs to be run for a long time or strong scaled out to a large number of partitions.

  4. The mesh is not the best quality. This is not to do a disservice to Jin Seok, making meshes for high order methods is challenging and this is a very challenging case, however the mesh quality could potentially be improved and this is a contributing factor to the difficult startup proceedure.

Opinion part:

All in all this is a very challenging case to run and unless you really want to run this case to explore some aspect of the physics that lack understanding I would recommend other cases. If you just want to run a CFD case, I would recommend the SD7003 case at a shallow angle of attack and a Reynold number O(10^5).

This being said, I know that there are some people/organisations out there that use this case as a proof of scaling and this a good candidate. Although, some of the setup requirments make it less than ideal. A worthy alterantive might be the T106 low pressure turbine. There is lots of data out there on it, it is a very standard LES test case, and is much more straightforward to get working.