MPI collect wait times informations

Frankx9 · 13 June 2023 12:43

Hi,

I was trying to detect load imbalances but when using collect wait times in the backed section, my solution file only has these informations:

[data]
fields = rho,rhou,rhov,rhow,E
prefix = soln

[backend]
rank-allocation = 0,1,2,3,4

[solver-time-integrator]
tcurr = 0.02
wall-time = 538.5784029960632
plugin-wall-time-nancheck = 0.03372669219970703
plugin-wall-time-writer = 0.018815994262695312
plugin-wall-time-other = 0.1612997055053711
nsteps = 4096
nacptsteps = 4087
nrjctsteps = 9
nfevals = 20480

There is no information regarding MPI ranks, how I can get those info?
Moreover, the key plugin-wall-time-****** refers to the cumulative time spent out of pure computation (matrix multiplications ecc…) by that plugin?

Regards

fdw · 13 June 2023 13:06

This suggests you do not have wait-time collection enabled. Please enable it as:

[backend]
collect-wait-times = True

Regards, Freddie.

Frankx9 · 13 June 2023 15:30

Hi Freddie,

I have it enabled, may the issue be related to the fact that I’m running on MacOS?

Regards

fdw · 13 June 2023 15:49

It should not make a difference.

Are you sure you are running a simulation with multiple ranks?

Regards, Freddie.

Frankx9 · 13 June 2023 15:55

Yes I am. From the documentation:

Note that the number of graphs depends on the system, and not all graphs initiate MPI requests. The average amount of time each rank spends waiting for MPI requests per right hand side evaluation can be obtained by vertically summing all of the -median fields together.

May be this my case?

fdw · 13 June 2023 16:05

I just tested the code on macOS and got the expected output. Can you dump the configuration file from the solution (h5dump -d /config ...pyfrs)?

Regards, Freddie.

Frankx9 · 13 June 2023 16:16

Yup and the output is the one I’ve attached above

Regards

fdw · 13 June 2023 16:17

What you provided is the /stats record. I would like to see the /config record.

Regards, Freddie.

Frankx9 · 13 June 2023 17:00

Got it, reinstalling pyfr solved the issue (don’t know really why tbh)

Thanks a lot for the help, one last question how you can interpret those data to identify load imbalances? I have 5 partitions and this is the output:

[backend-wait-times]
rhs-graph-0-mean = 0.000893,0.00105,0.00092,0.00193,0.000763
rhs-graph-0-stdev = 0.00112,0.00126,0.00113,0.00162,0.00108
rhs-graph-0-median = 0.000607,0.000825,0.000637,0.00171,0.000433
rhs-graph-1-mean = 0.00281,0.00281,0.00277,0.00456,0.00275
rhs-graph-1-stdev = 0.00279,0.00282,0.0028,0.00349,0.00297
rhs-graph-1-median = 0.00241,0.00239,0.0023,0.00415,0.00199
rhs-graph-2-mean = 0,0,0,0,0
rhs-graph-2-stdev = 0,0,0,0,0
rhs-graph-2-median = 0,0,0,0,0

I can say that given the order of magnitude of the medians of the graphs is roughly the same the partition are actually well balanced?

Regards

fdw · 13 June 2023 17:14

The second to last partition (4 of 5) may have too few elements. If you sum the median values it spends more time waiting than other ranks, implying it is finishing its work sooner.

Saying that, the difference is not huge and so the balance is likely to still be reasonable.

Regards, Freddie.

Frankx9 · 13 June 2023 18:28

partition 4 has a total wait time of 0.00586 while partition 5 has 0.002423. What would you consider as a huge difference? A order of magnitude of difference? If for example p5 would have a total wait time of 1e-5, how would you repartition those element?

fdw · 13 June 2023 18:58

Assuming the data from your first post holds we do 20480 RHS evaluations. The extra wait time for p4 is (0.00586 - 0.002423)*20480 which is ~70s. Your wall time is ~538s so this suggests that maybe a 10% improvement in runtime is possible if everything balanced out.

Regards, Freddie.

Frankx9 · 13 June 2023 19:13

Ok thanks a lot, but how can I try to repartition in a more balanced way? my grid only has hex and prism. Is there a particular though process in the weight assignation to metis?

Regards

fdw · 13 June 2023 20:06

This is covered in the performance tuning section of the documentation.

Regards, Freddie.

Frankx9 · 13 June 2023 23:18

I’m aware of the documentation but to me it is still unclear the precise actions to take.

Element types which are more computationally intensive should be assigned a larger weight than those that are less intensive.

…

Weights can be specified when partitioning the mesh as `-e shape:weight`. For example, if on a particular system a quadrilateral is found to be 50% more expensive than a triangle this can be specified as:

pyfr partition -e quad:3 -e tri:2 ...

Could you explain how the 50% more expensiveness of quad w.r.t to triangles relates to those weights?

If precise profiling data is not available regarding the performance of each element type in a given configuration

How I can get those data per element type to assess how costly they are on my hardware?

Regards

fdw · 14 June 2023 00:18

It is encoded by -e quad:3 -e tri:2 since 3/2 = 1.5 and thus we are giving quads a 50% larger weight than tris.

You will need to run a simulation (with the same settings as your actual simulation) on a pure hex mesh and get the timing data, and then again on a pure pri mesh. Then you can obtain a rough cost per element.

Regards, Freddie.

Frankx9 · 14 June 2023 09:12

Ok thanks a lot, but

a priori this cannot be known if only polynomial order is accounted for ? E.g. a ratio between total DoF between each element type
to account properly for the expensiveness of each element type I should also remesh accordingly to have roughly the same DoF given that they differ per element type (?)

Regards

fdw · 14 June 2023 11:51

The number of DoF is not an appropriate metric as it does not account for the sparsity present in some operators (hex operators are sparser than prisms, for example), or the differing number of flux points (prisms tend to have more flux points per DoF than hexes).

This is why the performance tuning guide suggests benchmarking to find the relative cost and weight accordingly.

Regards, Freddie.

Frankx9 · 14 June 2023 12:34

once benchmarked, in your experience, the relative weights is basically mostly hardware dependant, case dependant or both?
Lastly,

plugin-wall-time-nancheck = 0.03372669219970703
plugin-wall-time-writer = 0.018815994262695312
plugin-wall-time-other = 0.1612997055053711
nsteps = 4096

those time refer to the total time spent running plugins or do they actually have to be multiplied for the number of steps / nfevals?

Regards

fdw · 14 June 2023 13:01

The weights are typically hardware and case dependent.

The wall clock time given for the plugins is the total amount of time spent in them.

Regards, Freddie.

Topic		Replies	Views
GPU parallelization and scalability Cases hpc , cuda	2	36	7 April 2025
Kernels for corrected gradient at interface Development	17	75	7 April 2025
Whats the effect of CUDA-aware MPI? General	1	184	2 March 2017
Issue with Progress Option in OpenMP Backend Errors	1	82	6 March 2024
Partitions vs. MPI ranks in large run. General	15	642	20 May 2021

MPI collect wait times informations

Related topics