restarting from partitioned mesh fails

subutai · 17 December 2025 13:32

Nature of bug

I am trying the newest development version because i am interested in some of the new features.

I ran my simulation case on A100 cards (classical external flow, airfoil section), starting from using p0 to get rid of the initial transitions, then p2 after partitioning using the new partitioning features. From the last flow state, i tried to restart using p4. It always fails instantly due to instabilities. Even if i use p1, and even with single core, restarting from the solution output by the p4 simulation fails. For example, if i run with p4 for a longer time with the same conditions, that failing time instant passes by without problems. that is to say, it is all about restarting from the p4 solution generated by the partitioning.

I partitioned with the command:

pyfr partition add -e balanced -p metis du96W180_noTripping.pyfrm 8 p4partitioningBalanced

And restart using the command

pyfr restart -b cuda -P $PARTITION_NAME $MESH_FILE $SOLUTION_FILE $CONFIG_FILE

where the variables are set properly. Could there be a bug? Or, did i do something silly?

Error print out (if applicable)

minimum time step rejected

, or NaN if i use constant time steps, no matter how much i decrease the time step.

PyFR information

PyFR version: development version (729d7b6cfb334e36745de8e96a4f79a356222277)
OS: Linux
System description: 4 × Nvidia A100, 2 × Zen3 EPYC 7513
Compiler and version: gcc/13.2.0, openmpi/5.0.7, cuda/12.6.2, hdf5/1.14.5, python/3.11.9
Backend (if applicable): CUDA

Git diff that reproduces bug

fdw · 17 December 2025 13:33

Do you have a procedure where starting up without changing partitioning works but changing partitioning doesn’t (specifically, where the one and only difference is a change of partitioning)? What does the flow look like before it diverges?

Regards, Freddie.

subutai · 19 December 2025 21:47

Unfortunately, the mesh is too large to restart without any partitioning using p4. However, using p1, the restart fails.

Some additional findings:

i was not able to reproduce the issue on the tutorial case 3d-triangular-airfoil. I imagine, it can have something to do with having hex+pri elements because the tutorial case has only hex elements. that is, the issue can be related to -p metis -e balanced and hex+pri elements on a large mesh (generated in pointwise). I use METIS-5.2.1, which i compiled myself.
as i was suspecting -p metis -e balanced, i tried with -e hex:6 -e pri:5. The restarted failed again
i tried to use reconstruct the partition from the state written by the p4 simulation. restarting using that reconstruction failed again.
i also tried scotch -e hex:6 -e pri:5 -p scotch with no success. same failure..

Next time, i will generate a small test mesh with hex & pri, and try to reproduce the issue.

Regards

Kenan

fdw · 20 December 2025 08:51

What happens if you run with that partitioning from the start (so never changing partitioning at all using it for p = 1 and p = 4)? When it does fail do you have any renderings? Is the failure on a partition interface?

Regards, Freddie.

subutai · 20 December 2025 22:41

this time, i did the partitioning from the start, ran the simulation from scratch with p2. Then, i tried to restart the simulation using p2 and the same partitioning without changing anything. the restart failed. if i don’t stop the simulation at that time level, it goes on without any issues.

it fails instantly during the restart attempt, without any chance to take a time step. that is, there is literally no chance to view the flow field just before the blow-up.

fdw · 21 December 2025 09:07

That’s fine, also. Can you visualise the file you are trying to restart from? This will help us determine if the issue is with the restart code or the file writing code.

Regards, Freddie.

subutai · 21 December 2025 11:53

good point. visualizing is also impossible. As i try to import, I get error messages:

pyfr export volume ../du96W180_BANC_noTripping.pyfrm solution_0.00010000.pyfrs export.pvtu

/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/solvers/euler/elements.py:56: RuntimeWarning: invalid value encountered in divide
vs = [rhov/rho for rhov in cons[1:-1]]
/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/solvers/euler/elements.py:56: RuntimeWarning: divide by zero encountered in divide
vs = [rhov/rho for rhov in cons[1:-1]]
/scratch-grete/usr/niikceng/PyFR/PyFR_2.1.0dev/pyfr/writers/vtk/base.py:13: RuntimeWarning: invalid value encountered in matmul
ipts = op.astype(pts.dtype) @ pts.reshape(op.shape[1], -1)

this is probably a more useful hint. after failing as shown above, the only vtu file prepared was the that of the first partition. opening that file in paraview, i could easily observe that the density values on a few of the triangular prisms were zero. somehow, the vtu file of the first partition looked like the entire computational domain. that is confusing.. i checked the number of elements of this partition: 300010, whereas the full mesh has 3000600.

fdw · 21 December 2025 12:06

This suggests that the file writing is failing (as in you’re getting blocks of zeros instead of actual values).

Can you confirm what file system is being used?

Regards, Freddie.

subutai · 22 December 2025 12:44

The file system is Lustre, as explained below under the title “Lustre-RZG“:

I have just tried the stripe count & size adjustment, as in the example at the bottom of the page. No idea what the ideal values should be.

Next time, maybe i can try another file system available. that is, the “ceph-ssd“ files system.

Thanks!

Kenan

Update: on another filesystem (ceph-ssd), the job failed while writing the t=0 solution output:

Traceback (most recent call last):
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/ENV/pyfr_v2.1.0dev/bin/pyfr", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 242, in main
[1766413715.665341] [ggpu126:3851517:0]    cuda_copy_md.c:379  UCX  ERROR   attempt to allocate cuda memory without active context
[1766413715.665347] [ggpu126:3851517:0]         uct_mem.c:158  UCX  ERROR   failed to allocate 536870912 bytes using md cuda_cpy for ucp_rndv_frags: No such device
[1766413715.665350] [ggpu126:3851517:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=ucp_rndv_frags) chunk: Out of memory
[1766413715.665530] [ggpu126:3851519:0]    cuda_copy_md.c:687  UCX  ERROR cuMemGetAddressRange(0x14e0bb9df600) error: invalid device context
[1766413715.665541] [ggpu126:3851519:0]          ucp_mm.c:574  UCX  ERROR uct_md_mem_query(dmabuf address 0x14e0bb9df600 length 121200) failed: Address not valid
[1766413715.665646] [ggpu120:3309122:0]    cuda_copy_md.c:687  UCX  ERROR cuMemGetAddressRange(0x1489f2962a00) error: invalid device context
[1766413715.665662] [ggpu120:3309122:0]          ucp_mm.c:574  UCX  ERROR uct_md_mem_query(dmabuf address 0x1489f2962a00 length 102960) failed: Address not valid
[1766413715.666129] [ggpu126:3851518:0]    cuda_copy_md.c:379  UCX  ERROR   attempt to allocate cuda memory without active context
[1766413715.666138] [ggpu126:3851518:0]         uct_mem.c:158  UCX  ERROR   failed to allocate 536870912 bytes using md cuda_cpy for ucp_rndv_frags: No such device
[1766413715.666141] [ggpu126:3851518:0]           mpool.c:269  UCX  ERROR Failed to allocate memory pool (name=ucp_rndv_frags) chunk: Out of memory

...

    args.process(args)
    args.process(args)
    args.process(args)
    args.process(args)
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
    _process_common(args, None, Inifile.load(args.cfg))
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    solver.run()
    solver.run()
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
    args.process(args)
    solver.run()
    solver.run()
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
    args.process(args)
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
    args.process(args)
    _process_common(args, None, Inifile.load(args.cfg))
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    solver.run()
    solver.run()
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
    _process_common(args, None, Inifile.load(args.cfg))
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
    solver.run()
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
    self.advance_to(t)
    self.advance_to(t)
    self.advance_to(t)
  File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 181, in advance_to

and creating crazy output file names like pyfr-646052d5-a0f0-4637-a96d-6b794b041f7a.pyfrs

fdw · 22 December 2025 14:50

Lustre should be fine so long as it is configured correctly. Details regarding I/O in PyFR can be found here:

github.com/PyFR/PyFR

doc/src/io.rst

b5ac1c4a1

.. highlight:: none

********
Disk I/O
********

Solution File Writing
^^^^^^^^^^^^^^^^^^^^^

Plugins in PyFR which write ``.pyfrs`` solution files support several
different modes of operation. Specifically, file write operations can
be *serial* or *parallel* and can be performed *synchronously* or
*asynchronously*. These modes have implications for both performance
and robustness. Irrespective of what mode is employed the writes
themselves are *always* performed using Python's I/O routines in lieu
of MPI I/O.

In *serial* mode all data is written out by the root rank.  As such
this method is not suitable for large-scale simulations with practical
I/O rates being limited to a few gigabytes per second. This

This file has been truncated. show original

I would first suggest switching over to synchronous I/O and seeing if this changes anything. If this fails I would follow up with your sysadmin.

Regards, Freddie.

fdw · 22 December 2025 15:23

Those file names are expected and are temporary names used while PyFR is writing out a file. Once the writing is complete the file is renamed into place. This ensures that we do not clobber a good output file should PyFR be terminated mid-write.

Regards, Freddie.

subutai · 22 December 2025 21:47

setting async-timeout=0 worked on lustre (not on ceph)!

However, this time, the following error message, which is probably unrelated, dooms:

    self.advance_to(t)
    self.advance_to(t)
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 183, in advance_to
    self.advance_to(t)
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 183, in advance_to
    self.advance_to(t)
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 183, in advance_to
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 183, in advance_to
    err = self._errest(idxcurr, idxprev, idxerr)
    err = self._errest(idxcurr, idxprev, idxerr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 148, in _errest
    err = self._errest(idxcurr, idxprev, idxerr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    err = self._errest(idxcurr, idxprev, idxerr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 148, in _errest
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 148, in _errest
  File "/mnt/vast-nhr/home/kenan.cengiz01/u18163/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 148, in _errest
    err = math.sqrt(float(err) / self._gndofs)
    err = math.sqrt(float(err) / self._gndofs)
    err = math.sqrt(float(err) / self._gndofs)
    err = math.sqrt(float(err) / self._gndofs)
                    ^^^^^^^^^^
TypeError: only 0-dimensional arrays can be converted to Python scalars
                    ^^^^^^^^^^
TypeError: only 0-dimensional arrays can be converted to Python scalars
                    ^^^^^^^^^^
TypeError: only 0-dimensional arrays can be converted to Python scalars
                    ^^^^^^^^^^
TypeError: only 0-dimensional arrays can be converted to Python scalars

the err variable looks like an array with one value. changing the code (sometimes?) to float(err[0]), or using the setting pi=none helps. I don’t understand why this problem occurs, because the controllers.py file has not been changed since 2 years ago, and i have been without problems using pyfr and its pi controller since 2 years ago . maybe my current mpi or mpi4py installation does not do the sum-reduction as expected.

fdw · 22 December 2025 22:00

Can you confirm your NumPy version? I know NumPy are looking to change some of the default behaviours here.

Regards, Freddie.

subutai · 22 December 2025 22:03

numpy version is 2.4.0

it seems, it was 2.3.5 previously. today at some point, i desperately recreated my python environment for a fresh installation. The new numpy version must have come up in the meantime

Regards

Kenan

fdw · 23 December 2025 08:30

This seems to be a breaking change with v2.4.0. I’ll get a fix into the v3 release candidate.

Regards, Freddie.

fdw · 29 December 2025 13:50

A fix for this issue is in:

I would like to try and debug the other issues you’re observing. Can you confirm what does/does not work and we can go from there.

Regards, Freddie.

subutai · 31 December 2025 10:22

Thanks!!

Everything works fine now, except the file write. Currently, the file write with async-timeout=0 works only on the ceph-ssd system. On the lustre-rzg system, it keeps failing. Sorry for the confusion: it seems, i typed it wrongly previously. that was a mistake. corrected: lustre fails; ceph-ssd works.

I should probably discuss this with our cluster support, although i don’t care about the file-write performance. In fact, i write the solution output file only once during one cluster job.

Regards

Kenan

fdw · 31 December 2025 11:38

On the Lustre side can you confirm what happens if you just run with a single rank? (Any of the test cases with PyFR will be fine here.) Then, what happens if you go to two ranks on the same node?

Regards, Freddie.

subutai · 31 December 2025 13:20

in the test case 3d-triangular-airfoil, restarting works on lustre, with and without the async-timeout=0 option.

Regards, Kenan

fdw · 31 December 2025 13:59

What happens if there are two ranks? Do things still work?

Also, in the CEPH case can you confirm if a non-zero async timeout works?

Regards, Freddie.

Topic		Replies	Views
RuntimeError: No partitioners available Errors partition	4	202	20 March 2024
Restarting PyFR simulation of same mesh with different partition Cases mesh	3	48	28 August 2025
Problem about partitioning mesh file Errors mesh	9	438	14 April 2023
Unable to open .pyfrs General	6	170	7 December 2016
Metis Partition error Errors	1	204	24 November 2023

restarting from partitioned mesh fails

Nature of bug

Error print out (if applicable)

PyFR information

Git diff that reproduces bug

Related topics