Partitions vs. MPI ranks in large run.

Hello,

I’ve been running PyFR without problems on a mesh with around 25M elements on up to ~2k nodes (~12k GPUs). A further experiment on 4096 nodes, however, leads to the following error:

1: 0: Traceback (most recent call last):
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/bin/pyfr”, line 8, in
1: 0: sys.exit(main())
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/main.py”, line 112, in main
1: 0: args.process(args)
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/main.py”, line 245, in process_run
1: 0: args, NativeReader(args.mesh), None, Inifile.load(args.cfg)
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/main.py”, line 223, in _process_common
1: 0: rallocs = get_rank_allocation(mesh, cfg)
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/rank_allocator.py”, line 14, in get_rank_allocation
1: 0: return subclass_where(BaseRankAllocator, name=name)(mesh, cfg)
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/rank_allocator.py”, line 32, in init
1: 0: ‘{1} MPI ranks’.format(nparts, comm.size))
1: 0: RuntimeError: Mesh has 24575 partitions but running with 24576 MPI ranks

There are no errors when the mesh is partitioned, only when the case is run.

Any ideas about what might be happening?

Many thanks,
Eloisa

When running in parallel the number of partitions in the mesh must be equal to the number of MPI ranks. The error suggests that the mesh was partitioned into 24575 pieces but the case was run with 24576 ranks, hence the error.

(If the mesh was indeed partitioned into 24576 pieces, the error indicates that one of the partitions is isolated from the others; as in it does not communicate with any other rank. This typically indicates an issue with the mesh itself as it suggests that there are two independent problems in the mesh as opposed to one big problem.)

1 Like

Thanks for the clarification!

The mesh was indeed partitioned into 24576 (at least, this is what was requested). I am not surprised about the disconnected component, as this is quite a complex geometry. However, what are the implications? Can this case not be run on this number of ranks altogether?

Thanks,
Eloisa

It would certainly be possible to extend PyFR to support meshes with disconnected components. However, usually if you have disconnected components in your mesh it is better to split the mesh itself. So rather than running one simulation you run N independent simulations where N is the number of components. This is more robust, more flexible, and typically performs better.

If this is not practical you can try repartitioning the mesh a few times. METIS makes use of a random seed and so a repartitioning may result in a grid where no single disconnected component is isolated to a single partition.

Interesting suggestion, I will give it a try and report back.

Albeit complex, the geometry is not actually disconnected, so I am not sure there’s a meaningful way to run this case in stages. The disconnection must be an artifact of the specific random seed, algorithm, and number of partitions.

Would you be able to confirm what partitioner you are using and in the event that it is METIS, if METIS has been configured to use 32- or 64-bit integers?

It appears as if METIS is susceptible to 32-bit integer overflows internally when partitioning moderate size meshes.

Regards, Freddie.

Thanks, Freddie! I am using parmetis, with 32-bit integers. I’ll switch and see what happens.

I’ve also tried repartitioning the mesh a few times (as we had discussed), to no avail.

Best,
Eloisa

You can check the data type by checking the output of:

$ cat /usr/include/metis.h| grep '#define IDXTYPEWIDTH'

which should give

#define IDXTYPEWIDTH 64

An alternative to recompiling METIS is to switch to Scotch. Whilst it uses 32-bit integers for its public interface it does not appear to suffer from overflow internally (and since our graphs contain well below 2 billion vertices this is not a concern).

Regards, Freddie.

Thanks!

I am trying the first workaround (switching to 64-bit parmetis), but it will take a while before the run gets through the queue as it is quite large. Depending on how this goes, I might switch to Scotch.

Eloisa

No luck with 64-bit parmetis… I get the same error about a mismatch between number of partitions and number of ranks. I will try Scotch next.

Notice that my mesh is quite complex, so this may be a genuine complaint (there may really be a disconnected component in this decomposition).

After partitioning would you be able to attach the output of running h5ls on the resulting mesh file? This will enable us to get a better hand on what is going on.

Regards, Freddie.

Sure thing! It’s 15MB worth of metadata though… any specific part you would like to see?

I can confirm that some datasets groups (such as spt_tet_*) only have 24575 members.

Best,
Eloisa

Assuming your mesh only has a single element type this implies that we are not seeing a disconnected region but instead that the partitioner is not making use of all of the partitions. I believe METIS is ‘allowed’ to do this (or rather if you ask for N partitions it does not guarantee that they will all be populated).

Your options here are to switch to Scotch, or tweak some of the METIS parameters to see if this resolves the issue. For example switching from recursive bisection to k-way partitioning. This can be done as:

pyfr partition -p metis --popt=ptype:kway ...

There are some other options we expose which can also have an impact.

Regards, Freddie.

I’ve just tried the extra option, and it has not worked. I still get one less partition than I should.

(An additional piece of information: the same h5ls command returns the right number in the mesh files with up to 12288 partitions.)

It is certainly quite an aggressive partitioning. With ~25M elements you’re looking at ~1000 elements per partition, which when running on a V100 is a factor of ~20 below where you want to be given the number of CUDA cores and the amount of work required to keep the SM’s busy.

Regards, Freddie.

Definitely. This was intentional – the run above is the extreme case of a strong-scaling study, so ultimately a break-down due to problem size was expected.