Resample utility fails due to a NIC issue or something else?

I am making progress with my cases, but I now wanted to try out the pyfr resample utility as it would be extremely useful to remap between different resolution meshes.

The NIC failure implies a problem on our HPC side, but I want to verify with you, I haven’t misunderstood something about how the tool is supposed to work. Also if you can suggest some practical way to resolve the problem.

I am mapping between 3th order solution (initialisation). Small mesh was done 16 ranks and the large mesh was done with 64. Should this work in the first place?

#!/bin/bash
#SBATCH --job-name=${d}_resample
#SBATCH --chdir=$d
#SBATCH --output=logs/resample_output.txt
#SBATCH --error=logs/resample_errors.txt
#SBATCH --time=48:00:00
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=32

srun --mpi=pmix \\
    pyfr \\
        resample \
        -i idw \
        -P p64 \
        ~/study/meshes/001m.pyfrm \
        ../coarse/k3-014.5.pyfrs \
        ~/study/meshes/100m.pyfrm \
        main_p3.ini \
        resampled.pyfrs \

The error I get is:

Abort(412691087) on node 31 (rank 31 in comm 0): Fatal error in internal_Alltoall_c: Other MPI error, error stack:
internal_Alltoall_c(347).....................: MPI_Alltoall_c(sendbuf=0x1e57f20, sendcount=1, MPI_LONG, recvbuf=0x1dc4950, recvcount=1, MPI_LONG, MPI_COMM_WORLD) failed
MPID_Alltoall(1134)..........................:
MPIDI_Alltoall_allcomm_composition_json(1065):
MPIDI_Alltoall_intra_composition_beta(1287)..:
MPIDI_NM_mpi_alltoall(290)...................:
MPIR_Alltoall_impl(3088).....................:
MPIR_Alltoall_allcomm_auto(3012).............:
MPIR_Alltoall_intra_brucks(110)..............:
MPIC_Sendrecv(263)...........................:
MPIC_Wait(90)................................:
MPIR_Wait(751)...............................:
MPIR_Wait_state(708).........................:
MPIDI_progress_test(142).....................:
MPIDI_OFI_handle_cq_error(786)...............: OFI poll failed (default nic=ibp63s0: Input/output error)

A few ideas:

  • I probably don’t need a GPU in this request as I think everything is happening on the CPU, is that right?
  • Maybe I underestimated the memory requirement for 3rd order 100m element mesh, so the NIC failure is actually a memory problem.
  • Maybe the partitioning needs to match?

There is no requirement for the partitioning to match. The only requirement is that both meshes must have a partitioning with the requested number of parts (64 in your case). The MPI error is likely something outside of PyFR. Our usual recommendation is to reproduce with MPICH with Ethernet and go from there.

If you have a high memory node you can also try running the resampler on just a single rank. The resampler doesn’t make use of GPUs so when everything is working as expected one rank per core is usually about right.

Also, note that the resampling is currently just a weighted average of the nearest neighbours. This is first order accurate. Still, it has shown itself to be ‘useful enough’ but we are looking to improve the interpolation in the future—especially in the context of simulations with shocks.

Regards, Freddie.