Partitions vs. MPI ranks in large run.

EloisaBentivegna · 9 April 2021 13:38

Hello,

I’ve been running PyFR without problems on a mesh with around 25M elements on up to ~2k nodes (~12k GPUs). A further experiment on 4096 nodes, however, leads to the following error:

1: 0: Traceback (most recent call last):
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/bin/pyfr”, line 8, in
1: 0: sys.exit(main())
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/main.py”, line 112, in main
1: 0: args.process(args)
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/main.py”, line 245, in process_run
1: 0: args, NativeReader(args.mesh), None, Inifile.load(args.cfg)
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/main.py”, line 223, in _process_common
1: 0: rallocs = get_rank_allocation(mesh, cfg)
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/rank_allocator.py”, line 14, in get_rank_allocation
1: 0: return subclass_where(BaseRankAllocator, name=name)(mesh, cfg)
1: 0: File “/ccs/home/bentivegna/.conda/envs/pyfr/lib/python3.7/site-packages/pyfr/rank_allocator.py”, line 32, in init
1: 0: ‘{1} MPI ranks’.format(nparts, comm.size))
1: 0: RuntimeError: Mesh has 24575 partitions but running with 24576 MPI ranks

There are no errors when the mesh is partitioned, only when the case is run.

Any ideas about what might be happening?

Many thanks,
Eloisa

fdw · 9 April 2021 13:55

When running in parallel the number of partitions in the mesh must be equal to the number of MPI ranks. The error suggests that the mesh was partitioned into 24575 pieces but the case was run with 24576 ranks, hence the error.

(If the mesh was indeed partitioned into 24576 pieces, the error indicates that one of the partitions is isolated from the others; as in it does not communicate with any other rank. This typically indicates an issue with the mesh itself as it suggests that there are two independent problems in the mesh as opposed to one big problem.)

EloisaBentivegna · 9 April 2021 18:47

Thanks for the clarification!

The mesh was indeed partitioned into 24576 (at least, this is what was requested). I am not surprised about the disconnected component, as this is quite a complex geometry. However, what are the implications? Can this case not be run on this number of ranks altogether?

Thanks,
Eloisa

fdw · 9 April 2021 19:08

It would certainly be possible to extend PyFR to support meshes with disconnected components. However, usually if you have disconnected components in your mesh it is better to split the mesh itself. So rather than running one simulation you run N independent simulations where N is the number of components. This is more robust, more flexible, and typically performs better.

If this is not practical you can try repartitioning the mesh a few times. METIS makes use of a random seed and so a repartitioning may result in a grid where no single disconnected component is isolated to a single partition.

EloisaBentivegna · 9 April 2021 20:11

Interesting suggestion, I will give it a try and report back.

Albeit complex, the geometry is not actually disconnected, so I am not sure there’s a meaningful way to run this case in stages. The disconnection must be an artifact of the specific random seed, algorithm, and number of partitions.

fdw · 13 May 2021 16:50

Would you be able to confirm what partitioner you are using and in the event that it is METIS, if METIS has been configured to use 32- or 64-bit integers?

It appears as if METIS is susceptible to 32-bit integer overflows internally when partitioning moderate size meshes.

Regards, Freddie.

EloisaBentivegna · 13 May 2021 22:28

Thanks, Freddie! I am using parmetis, with 32-bit integers. I’ll switch and see what happens.

I’ve also tried repartitioning the mesh a few times (as we had discussed), to no avail.

Best,
Eloisa

fdw · 13 May 2021 23:27

You can check the data type by checking the output of:

$ cat /usr/include/metis.h| grep '#define IDXTYPEWIDTH'

which should give

#define IDXTYPEWIDTH 64

An alternative to recompiling METIS is to switch to Scotch. Whilst it uses 32-bit integers for its public interface it does not appear to suffer from overflow internally (and since our graphs contain well below 2 billion vertices this is not a concern).

Regards, Freddie.

EloisaBentivegna · 14 May 2021 09:46

Thanks!

I am trying the first workaround (switching to 64-bit parmetis), but it will take a while before the run gets through the queue as it is quite large. Depending on how this goes, I might switch to Scotch.

Eloisa

EloisaBentivegna · 18 May 2021 15:53

No luck with 64-bit parmetis… I get the same error about a mismatch between number of partitions and number of ranks. I will try Scotch next.

Notice that my mesh is quite complex, so this may be a genuine complaint (there may really be a disconnected component in this decomposition).

fdw · 18 May 2021 17:56

After partitioning would you be able to attach the output of running h5ls on the resulting mesh file? This will enable us to get a better hand on what is going on.

Regards, Freddie.

EloisaBentivegna · 18 May 2021 21:38

Sure thing! It’s 15MB worth of metadata though… any specific part you would like to see?

I can confirm that some datasets groups (such as spt_tet_*) only have 24575 members.

Best,
Eloisa

fdw · 18 May 2021 21:59

Assuming your mesh only has a single element type this implies that we are not seeing a disconnected region but instead that the partitioner is not making use of all of the partitions. I believe METIS is ‘allowed’ to do this (or rather if you ask for N partitions it does not guarantee that they will all be populated).

Your options here are to switch to Scotch, or tweak some of the METIS parameters to see if this resolves the issue. For example switching from recursive bisection to k-way partitioning. This can be done as:

pyfr partition -p metis --popt=ptype:kway ...

There are some other options we expose which can also have an impact.

Regards, Freddie.

EloisaBentivegna · 19 May 2021 11:23

I’ve just tried the extra option, and it has not worked. I still get one less partition than I should.

(An additional piece of information: the same h5ls command returns the right number in the mesh files with up to 12288 partitions.)

fdw · 19 May 2021 12:35

It is certainly quite an aggressive partitioning. With ~25M elements you’re looking at ~1000 elements per partition, which when running on a V100 is a factor of ~20 below where you want to be given the number of CUDA cores and the amount of work required to keep the SM’s busy.

Regards, Freddie.

EloisaBentivegna · 20 May 2021 11:06

Definitely. This was intentional – the run above is the extreme case of a strong-scaling study, so ultimately a break-down due to problem size was expected.

Topic		Replies	Views
Poor CPU strong scaling General	3	135	7 December 2016
GPU parallelization error, ranks not equal to gpu number Errors cuda , mpi	23	1266	20 September 2023
Error Running PyFR Across Multiple Nodes Errors	3	46	5 September 2024
Error: CUDAOutofMemory on A30 Errors config , cuda	11	242	10 February 2024
Running PyFR on multiple GPUs on HPC cluster Cases hpc	10	150	18 July 2024

Partitions vs. MPI ranks in large run.

Related topics