restarting from partitioned mesh fails

Nature of bug

I am trying the newest development version because i am interested in some of the new features.

I ran my simulation case on A100 cards (classical external flow, airfoil section), starting from using p0 to get rid of the initial transitions, then p2 after partitioning using the new partitioning features. From the last flow state, i tried to restart using p4. It always fails instantly due to instabilities. Even if i use p1, and even with single core, restarting from the solution output by the p4 simulation fails. For example, if i run with p4 for a longer time with the same conditions, that failing time instant passes by without problems. that is to say, it is all about restarting from the p4 solution generated by the partitioning.

I partitioned with the command:

pyfr partition add -e balanced -p metis du96W180_noTripping.pyfrm 8 p4partitioningBalanced

And restart using the command

pyfr restart -b cuda -P $PARTITION_NAME $MESH_FILE $SOLUTION_FILE $CONFIG_FILE

where the variables are set properly. Could there be a bug? Or, did i do something silly?

Error print out (if applicable)

minimum time step rejected

, or NaN if i use constant time steps, no matter how much i decrease the time step.

PyFR information

  • PyFR version: development version (729d7b6cfb334e36745de8e96a4f79a356222277)
  • OS: Linux
  • System description: 4 × Nvidia A100, 2 × Zen3 EPYC 7513
  • Compiler and version: gcc/13.2.0, openmpi/5.0.7, cuda/12.6.2, hdf5/1.14.5, python/3.11.9
  • Backend (if applicable): CUDA

Git diff that reproduces bug

Do you have a procedure where starting up without changing partitioning works but changing partitioning doesn’t (specifically, where the one and only difference is a change of partitioning)? What does the flow look like before it diverges?

Regards, Freddie.

Unfortunately, the mesh is too large to restart without any partitioning using p4. However, using p1, the restart fails.

Some additional findings:

  1. i was not able to reproduce the issue on the tutorial case 3d-triangular-airfoil. I imagine, it can have something to do with having hex+pri elements because the tutorial case has only hex elements. that is, the issue can be related to -p metis -e balanced and hex+pri elements on a large mesh (generated in pointwise). I use METIS-5.2.1, which i compiled myself.
  2. as i was suspecting -p metis -e balanced, i tried with -e hex:6 -e pri:5. The restarted failed again
  3. i tried to use reconstruct the partition from the state written by the p4 simulation. restarting using that reconstruction failed again.
  4. i also tried scotch -e hex:6 -e pri:5 -p scotch with no success. same failure..

Next time, i will generate a small test mesh with hex & pri, and try to reproduce the issue.

Regards

Kenan

What happens if you run with that partitioning from the start (so never changing partitioning at all using it for p = 1 and p = 4)? When it does fail do you have any renderings? Is the failure on a partition interface?

Regards, Freddie.

this time, i did the partitioning from the start, ran the simulation from scratch with p2. Then, i tried to restart the simulation using p2 and the same partitioning without changing anything. the restart failed. if i don’t stop the simulation at that time level, it goes on without any issues.

it fails instantly during the restart attempt, without any chance to take a time step. that is, there is literally no chance to view the flow field just before the blow-up.

That’s fine, also. Can you visualise the file you are trying to restart from? This will help us determine if the issue is with the restart code or the file writing code.

Regards, Freddie.

good point. visualizing is also impossible. As i try to import, I get error messages:

pyfr export volume ../du96W180_BANC_noTripping.pyfrm solution_0.00010000.pyfrs export.pvtu

/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/solvers/euler/elements.py:56: RuntimeWarning: invalid value encountered in divide
vs = [rhov/rho for rhov in cons[1:-1]]
/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/solvers/euler/elements.py:56: RuntimeWarning: divide by zero encountered in divide
vs = [rhov/rho for rhov in cons[1:-1]]
/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/writers/vtk/base.py:13: RuntimeWarning: invalid value encountered in matmul
ipts = op.astype(pts.dtype) @ pts.reshape(op.shape[1], -1)

this is probably a more useful hint. after failing as shown above, the only vtu file prepared was the that of the first partition. opening that file in paraview, i could easily observe that the density values on a few of the triangular prisms were zero. somehow, the vtu file of the first partition looked like the entire computational domain. that is confusing.. i checked the number of elements of this partition: 300010, whereas the full mesh has 3000600.

This suggests that the file writing is failing (as in you’re getting blocks of zeros instead of actual values).

Can you confirm what file system is being used?

Regards, Freddie.