MPI collect wait times informations



I was trying to detect load imbalances but when using collect wait times in the backed section, my solution file only has these informations:

fields = rho,rhou,rhov,rhow,E
prefix = soln

rank-allocation = 0,1,2,3,4

tcurr = 0.02
wall-time = 538.5784029960632
plugin-wall-time-nancheck = 0.03372669219970703
plugin-wall-time-writer = 0.018815994262695312
plugin-wall-time-other = 0.1612997055053711
nsteps = 4096
nacptsteps = 4087
nrjctsteps = 9
nfevals = 20480

There is no information regarding MPI ranks, how I can get those info?
Moreover, the key plugin-wall-time-****** refers to the cumulative time spent out of pure computation (matrix multiplications ecc…) by that plugin?


This suggests you do not have wait-time collection enabled. Please enable it as:

collect-wait-times = True

Regards, Freddie.

Hi Freddie,

I have it enabled, may the issue be related to the fact that I’m running on MacOS?


It should not make a difference.

Are you sure you are running a simulation with multiple ranks?

Regards, Freddie.

Yes I am. From the documentation:

Note that the number of graphs depends on the system, and not all graphs initiate MPI requests. The average amount of time each rank spends waiting for MPI requests per right hand side evaluation can be obtained by vertically summing all of the -median fields together.

May be this my case?

I just tested the code on macOS and got the expected output. Can you dump the configuration file from the solution (h5dump -d /config ...pyfrs)?

Regards, Freddie.

Yup and the output is the one I’ve attached above


What you provided is the /stats record. I would like to see the /config record.

Regards, Freddie.

Got it, reinstalling pyfr solved the issue (don’t know really why tbh)

Thanks a lot for the help, one last question how you can interpret those data to identify load imbalances? I have 5 partitions and this is the output:

rhs-graph-0-mean = 0.000893,0.00105,0.00092,0.00193,0.000763
rhs-graph-0-stdev = 0.00112,0.00126,0.00113,0.00162,0.00108
rhs-graph-0-median = 0.000607,0.000825,0.000637,0.00171,0.000433
rhs-graph-1-mean = 0.00281,0.00281,0.00277,0.00456,0.00275
rhs-graph-1-stdev = 0.00279,0.00282,0.0028,0.00349,0.00297
rhs-graph-1-median = 0.00241,0.00239,0.0023,0.00415,0.00199
rhs-graph-2-mean = 0,0,0,0,0
rhs-graph-2-stdev = 0,0,0,0,0
rhs-graph-2-median = 0,0,0,0,0

I can say that given the order of magnitude of the medians of the graphs is roughly the same the partition are actually well balanced?


The second to last partition (4 of 5) may have too few elements. If you sum the median values it spends more time waiting than other ranks, implying it is finishing its work sooner.

Saying that, the difference is not huge and so the balance is likely to still be reasonable.

Regards, Freddie.

partition 4 has a total wait time of 0.00586 while partition 5 has 0.002423. What would you consider as a huge difference? A order of magnitude of difference? If for example p5 would have a total wait time of 1e-5, how would you repartition those element?

Assuming the data from your first post holds we do 20480 RHS evaluations. The extra wait time for p4 is (0.00586 - 0.002423)*20480 which is ~70s. Your wall time is ~538s so this suggests that maybe a 10% improvement in runtime is possible if everything balanced out.

Regards, Freddie.

Ok thanks a lot, but how can I try to repartition in a more balanced way? my grid only has hex and prism. Is there a particular though process in the weight assignation to metis?


This is covered in the performance tuning section of the documentation.

Regards, Freddie.

I’m aware of the documentation but to me it is still unclear the precise actions to take.

Element types which are more computationally intensive should be assigned a larger weight than those that are less intensive.


Weights can be specified when partitioning the mesh as `-e shape:weight`. For example, if on a particular system a quadrilateral is found to be 50% more expensive than a triangle this can be specified as:

pyfr partition -e quad:3 -e tri:2 ...
  • Could you explain how the 50% more expensiveness of quad w.r.t to triangles relates to those weights?
If precise profiling data is not available regarding the performance of each element type in a given configuration
  • How I can get those data per element type to assess how costly they are on my hardware?


It is encoded by -e quad:3 -e tri:2 since 3/2 = 1.5 and thus we are giving quads a 50% larger weight than tris.

You will need to run a simulation (with the same settings as your actual simulation) on a pure hex mesh and get the timing data, and then again on a pure pri mesh. Then you can obtain a rough cost per element.

Regards, Freddie.

Ok thanks a lot, but

  • a priori this cannot be known if only polynomial order is accounted for ? E.g. a ratio between total DoF between each element type

  • to account properly for the expensiveness of each element type I should also remesh accordingly to have roughly the same DoF given that they differ per element type (?)


The number of DoF is not an appropriate metric as it does not account for the sparsity present in some operators (hex operators are sparser than prisms, for example), or the differing number of flux points (prisms tend to have more flux points per DoF than hexes).

This is why the performance tuning guide suggests benchmarking to find the relative cost and weight accordingly.

Regards, Freddie.

  • once benchmarked, in your experience, the relative weights is basically mostly hardware dependant, case dependant or both?

  • Lastly,

plugin-wall-time-nancheck = 0.03372669219970703
plugin-wall-time-writer = 0.018815994262695312
plugin-wall-time-other = 0.1612997055053711
nsteps = 4096

those time refer to the total time spent running plugins or do they actually have to be multiplied for the number of steps / nfevals?


The weights are typically hardware and case dependent.

The wall clock time given for the plugins is the total amount of time spent in them.

Regards, Freddie.

1 Like