restarting from partitioned mesh fails

Nature of bug

I am trying the newest development version because i am interested in some of the new features.

I ran my simulation case on A100 cards (classical external flow, airfoil section), starting from using p0 to get rid of the initial transitions, then p2 after partitioning using the new partitioning features. From the last flow state, i tried to restart using p4. It always fails instantly due to instabilities. Even if i use p1, and even with single core, restarting from the solution output by the p4 simulation fails. For example, if i run with p4 for a longer time with the same conditions, that failing time instant passes by without problems. that is to say, it is all about restarting from the p4 solution generated by the partitioning.

I partitioned with the command:

pyfr partition add -e balanced -p metis du96W180_noTripping.pyfrm 8 p4partitioningBalanced

And restart using the command

pyfr restart -b cuda -P $PARTITION_NAME $MESH_FILE $SOLUTION_FILE $CONFIG_FILE

where the variables are set properly. Could there be a bug? Or, did i do something silly?

Error print out (if applicable)

minimum time step rejected

, or NaN if i use constant time steps, no matter how much i decrease the time step.

PyFR information

  • PyFR version: development version (729d7b6cfb334e36745de8e96a4f79a356222277)
  • OS: Linux
  • System description: 4 × Nvidia A100, 2 × Zen3 EPYC 7513
  • Compiler and version: gcc/13.2.0, openmpi/5.0.7, cuda/12.6.2, hdf5/1.14.5, python/3.11.9
  • Backend (if applicable): CUDA

Git diff that reproduces bug

Do you have a procedure where starting up without changing partitioning works but changing partitioning doesn’t (specifically, where the one and only difference is a change of partitioning)? What does the flow look like before it diverges?

Regards, Freddie.

Unfortunately, the mesh is too large to restart without any partitioning using p4. However, using p1, the restart fails.

Some additional findings:

  1. i was not able to reproduce the issue on the tutorial case 3d-triangular-airfoil. I imagine, it can have something to do with having hex+pri elements because the tutorial case has only hex elements. that is, the issue can be related to -p metis -e balanced and hex+pri elements on a large mesh (generated in pointwise). I use METIS-5.2.1, which i compiled myself.
  2. as i was suspecting -p metis -e balanced, i tried with -e hex:6 -e pri:5. The restarted failed again
  3. i tried to use reconstruct the partition from the state written by the p4 simulation. restarting using that reconstruction failed again.
  4. i also tried scotch -e hex:6 -e pri:5 -p scotch with no success. same failure..

Next time, i will generate a small test mesh with hex & pri, and try to reproduce the issue.

Regards

Kenan

What happens if you run with that partitioning from the start (so never changing partitioning at all using it for p = 1 and p = 4)? When it does fail do you have any renderings? Is the failure on a partition interface?

Regards, Freddie.

this time, i did the partitioning from the start, ran the simulation from scratch with p2. Then, i tried to restart the simulation using p2 and the same partitioning without changing anything. the restart failed. if i don’t stop the simulation at that time level, it goes on without any issues.

it fails instantly during the restart attempt, without any chance to take a time step. that is, there is literally no chance to view the flow field just before the blow-up.

That’s fine, also. Can you visualise the file you are trying to restart from? This will help us determine if the issue is with the restart code or the file writing code.

Regards, Freddie.

good point. visualizing is also impossible. As i try to import, I get error messages:

pyfr export volume ../du96W180_BANC_noTripping.pyfrm solution_0.00010000.pyfrs export.pvtu

/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/solvers/euler/elements.py:56: RuntimeWarning: invalid value encountered in divide
vs = [rhov/rho for rhov in cons[1:-1]]
/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/solvers/euler/elements.py:56: RuntimeWarning: divide by zero encountered in divide
vs = [rhov/rho for rhov in cons[1:-1]]
/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/writers/vtk/base.py:13: RuntimeWarning: invalid value encountered in matmul
ipts = op.astype(pts.dtype) @ pts.reshape(op.shape[1], -1)

this is probably a more useful hint. after failing as shown above, the only vtu file prepared was the that of the first partition. opening that file in paraview, i could easily observe that the density values on a few of the triangular prisms were zero. somehow, the vtu file of the first partition looked like the entire computational domain. that is confusing.. i checked the number of elements of this partition: 300010, whereas the full mesh has 3000600.

This suggests that the file writing is failing (as in you’re getting blocks of zeros instead of actual values).

Can you confirm what file system is being used?

Regards, Freddie.

The file system is Lustre, as explained below under the title “Lustre-RZG“:

I have just tried the stripe count & size adjustment, as in the example at the bottom of the page. No idea what the ideal values should be.

Next time, maybe i can try another file system available. that is, the “ceph-ssd“ files system.

Thanks!

Kenan

Update: on another filesystem (ceph-ssd), the job failed while writing the t=0 solution output:

Traceback (most recent call last):
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/ENV/pyfr_v2.1.0dev/bin/pyfr", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 242, in main
[1766413715.665341] [ggpu126:3851517:0]    cuda_copy_md.c:379  UCX  ERROR   attempt to allocate cuda memory without active context
[1766413715.665347] [ggpu126:3851517:0]         uct_mem.c:158  UCX  ERROR   failed to allocate 536870912 bytes using md cuda_cpy for ucp_rndv_frags: No such device
[1766413715.665350] [ggpu126:3851517:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=ucp_rndv_frags) chunk: Out of memory
[1766413715.665530] [ggpu126:3851519:0]    cuda_copy_md.c:687  UCX  ERROR cuMemGetAddressRange(0x14e0bb9df600) error: invalid device context
[1766413715.665541] [ggpu126:3851519:0]          ucp_mm.c:574  UCX  ERROR uct_md_mem_query(dmabuf address 0x14e0bb9df600 length 121200) failed: Address not valid
[1766413715.665646] [ggpu120:3309122:0]    cuda_copy_md.c:687  UCX  ERROR cuMemGetAddressRange(0x1489f2962a00) error: invalid device context
[1766413715.665662] [ggpu120:3309122:0]          ucp_mm.c:574  UCX  ERROR uct_md_mem_query(dmabuf address 0x1489f2962a00 length 102960) failed: Address not valid
[1766413715.666129] [ggpu126:3851518:0]    cuda_copy_md.c:379  UCX  ERROR   attempt to allocate cuda memory without active context
[1766413715.666138] [ggpu126:3851518:0]         uct_mem.c:158  UCX  ERROR   failed to allocate 536870912 bytes using md cuda_cpy for ucp_rndv_frags: No such device
[1766413715.666141] [ggpu126:3851518:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=ucp_rndv_frags) chunk: Out of memory

...

    args.process(args)
    args.process(args)
    args.process(args)
    args.process(args)
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
    _process_common(args, None, Inifile.load(args.cfg))
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    solver.run()
    solver.run()
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
    args.process(args)
    solver.run()
    solver.run()
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
    args.process(args)
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
    args.process(args)
    _process_common(args, None, Inifile.load(args.cfg))
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    solver.run()
    solver.run()
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    solver.run()
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
    self.advance_to(t)
    self.advance_to(t)
    self.advance_to(t)
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 181, in advance_to

and creating crazy output file names like pyfr-646052d5-a0f0-4637-a96d-6b794b041f7a.pyfrs

Lustre should be fine so long as it is configured correctly. Details regarding I/O in PyFR can be found here:

I would first suggest switching over to synchronous I/O and seeing if this changes anything. If this fails I would follow up with your sysadmin.

Regards, Freddie.

Those file names are expected and are temporary names used while PyFR is writing out a file. Once the writing is complete the file is renamed into place. This ensures that we do not clobber a good output file should PyFR be terminated mid-write.

Regards, Freddie.

setting async-timeout=0 worked on lustre (not on ceph)!

However, this time, the following error message, which is probably unrelated, dooms:

    self.advance_to(t)
    self.advance_to(t)
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 183, in advance_to
    self.advance_to(t)
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 183, in advance_to
    self.advance_to(t)
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 183, in advance_to
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 183, in advance_to
    err = self._errest(idxcurr, idxprev, idxerr)
    err = self._errest(idxcurr, idxprev, idxerr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 148, in _errest
    err = self._errest(idxcurr, idxprev, idxerr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    err = self._errest(idxcurr, idxprev, idxerr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 148, in _errest
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 148, in _errest
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 148, in _errest
    err = math.sqrt(float(err) / self._gndofs)
    err = math.sqrt(float(err) / self._gndofs)
    err = math.sqrt(float(err) / self._gndofs)
    err = math.sqrt(float(err) / self._gndofs)
                    ^^^^^^^^^^
TypeError: only 0-dimensional arrays can be converted to Python scalars
                    ^^^^^^^^^^
TypeError: only 0-dimensional arrays can be converted to Python scalars
                    ^^^^^^^^^^
TypeError: only 0-dimensional arrays can be converted to Python scalars
                    ^^^^^^^^^^
TypeError: only 0-dimensional arrays can be converted to Python scalars

the err variable looks like an array with one value. changing the code (sometimes?) to float(err[0]), or using the setting pi=none helps. I don’t understand why this problem occurs, because the controllers.py file has not been changed since 2 years ago, and i have been without problems using pyfr and its pi controller since 2 years ago :slight_smile: . maybe my current mpi or mpi4py installation does not do the sum-reduction as expected.

Can you confirm your NumPy version? I know NumPy are looking to change some of the default behaviours here.

Regards, Freddie.

numpy version is 2.4.0

it seems, it was 2.3.5 previously. today at some point, i desperately recreated my python environment for a fresh installation. The new numpy version must have come up in the meantime

Regards

Kenan

This seems to be a breaking change with v2.4.0. I’ll get a fix into the v3 release candidate.

Regards, Freddie.

1 Like

A fix for this issue is in:

I would like to try and debug the other issues you’re observing. Can you confirm what does/does not work and we can go from there.

Regards, Freddie.

Thanks!!

Everything works fine now, except the file write. Currently, the file write with async-timeout=0 works only on the ceph-ssd system. On the lustre-rzg system, it keeps failing. Sorry for the confusion: it seems, i typed it wrongly previously. that was a mistake. corrected: lustre fails; ceph-ssd works.

I should probably discuss this with our cluster support, although i don’t care about the file-write performance. In fact, i write the solution output file only once during one cluster job.

Regards

Kenan

On the Lustre side can you confirm what happens if you just run with a single rank? (Any of the test cases with PyFR will be fine here.) Then, what happens if you go to two ranks on the same node?

Regards, Freddie.

in the test case 3d-triangular-airfoil, restarting works on lustre, with and without the async-timeout=0 option.

Regards, Kenan

What happens if there are two ranks? Do things still work?

Also, in the CEPH case can you confirm if a non-zero async timeout works?

Regards, Freddie.