Following by this question, I am trying to reproduce the jet simulation from this paper: https://data.mendeley.com/datasets/65m665nt9c/1
However when I tried device-id as round-robin, it worked with examples but it gave me ‘CudaOutOfMemory’ error in jet simulation. When I tried local-rank, it gave me ‘CUDAInvalidDevice’. I basically have no knowledge on gpu computing, can you give some advice on it?
The baseline simulation was run on 96 NVIDIA P100 GPUs. As such, in order to reproduce the simulation data in a reasonable period of time you’ll need a similar level of hardware. What exactly are you running with?
48 K80s (each with two GPUs for a total of 96 ranks) should certainly be enough. As such it is likely that your problem is related to assigning MPI ranks to GPUs. This is something that your system administrator should be able to help you with. (On the PyFR side you want to be using device-id = local-rank.)
Thanks for your great help. I have fixed it now. But the original mesh has been partitioned for 96 parts. Do you know how to assemble it or do you have the unpartitioned mesh for this case?
@fdw@Zhenyang
Hi, Freddie and Zhenyang, I got the same error when I tried to reproduce the naca0021 simulation following this work [1]. (I used its supplemental materials with PyFR1.14.0)
I first converted the Gmsh mesh file into a PyFR mesh file by pyfr import naca0021_p4_7c.msh naca0021_p4_7c.pyfrm
And then start the simulation by pyfr run -b cuda -p naca0021_p4_7c.pyfrm p4_aa_01.ini
Then it produced pyfr.backends.cuda.driver.CUDAOutofMemory.
Since I hardly know the GPU computing, could you give me some suggestions, please?
This may be helpful.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K40c Off | 00000000:03:00.0 Off | 0 |
| 23% 38C P8 19W / 235W | 13MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P5000 Off | 00000000:04:00.0 On | Off |
| 31% 48C P8 9W / 180W | 970MiB / 16277MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1246 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2067 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1246 G /usr/lib/xorg/Xorg 39MiB |
| 1 N/A N/A 2067 G /usr/lib/xorg/Xorg 377MiB |
| 1 N/A N/A 2198 G /usr/bin/gnome-shell 199MiB |
| 1 N/A N/A 9978 G ...998703167565289414,131072 152MiB |
| 1 N/A N/A 11559 G ...RendererForSitePerProcess 155MiB |
| 1 N/A N/A 35160 G ...araView/bin/paraview-real 31MiB |
+-----------------------------------------------------------------------------+
[1] Park, J. S., Witherden, F. D., & Vincent, P. E. (2017). High-order implicit large-eddy simulations of flow over a NACA0021 aerofoil. AIAA Journal , 55 (7), 2186-2197.
I’m writting this here as it as good a place as any and will hopefully aid people in the future. The NACA deep stall case is somewhat difficult to run for a number of reasons.
In deep stall there are some very large flow features, consequently the span required to not artifical enforce periodicity is very large. For example, for a low pressure turbine blade a span 20% of chord is normally considered to be acceptable in many circumstances. However, in the paper cited it seems that somewhere between 2 and 4 times the chord was needed.
The case has quite high memory requirments. As a result of the previous point, this cause is quite large in terms of total number of degrees of freedom. This is also compounded by the case having large seperated dynamics, and so there is a large region behind the suction surface where the mesh resolution is higher than what might normally be seen in a fully attached flow.
The time period of the dynamic is quite low, and so has to be run for a long time in order to generate meaningful time averages. This is a consequence of the shedding that is occuring this means that a significant amount of computational resouces are needed if a full run is to be attempted. As either case needs to be run for a long time or strong scaled out to a large number of partitions.
The mesh is not the best quality. This is not to do a disservice to Jin Seok, making meshes for high order methods is challenging and this is a very challenging case, however the mesh quality could potentially be improved and this is a contributing factor to the difficult startup proceedure.
Opinion part:
All in all this is a very challenging case to run and unless you really want to run this case to explore some aspect of the physics that lack understanding I would recommend other cases. If you just want to run a CFD case, I would recommend the SD7003 case at a shallow angle of attack and a Reynold number O(10^5).
This being said, I know that there are some people/organisations out there that use this case as a proof of scaling and this a good candidate. Although, some of the setup requirments make it less than ideal. A worthy alterantive might be the T106 low pressure turbine. There is lots of data out there on it, it is a very standard LES test case, and is much more straightforward to get working.