I am making progress with my cases, but I now wanted to try out the pyfr resample utility as it would be extremely useful to remap between different resolution meshes.
The NIC failure implies a problem on our HPC side, but I want to verify with you, I haven’t misunderstood something about how the tool is supposed to work. Also if you can suggest some practical way to resolve the problem.
I am mapping between 3th order solution (initialisation). Small mesh was done 16 ranks and the large mesh was done with 64. Should this work in the first place?
#!/bin/bash
#SBATCH --job-name=${d}_resample
#SBATCH --chdir=$d
#SBATCH --output=logs/resample_output.txt
#SBATCH --error=logs/resample_errors.txt
#SBATCH --time=48:00:00
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=32
srun --mpi=pmix \\
pyfr \\
resample \
-i idw \
-P p64 \
~/study/meshes/001m.pyfrm \
../coarse/k3-014.5.pyfrs \
~/study/meshes/100m.pyfrm \
main_p3.ini \
resampled.pyfrs \
The error I get is:
Abort(412691087) on node 31 (rank 31 in comm 0): Fatal error in internal_Alltoall_c: Other MPI error, error stack:
internal_Alltoall_c(347).....................: MPI_Alltoall_c(sendbuf=0x1e57f20, sendcount=1, MPI_LONG, recvbuf=0x1dc4950, recvcount=1, MPI_LONG, MPI_COMM_WORLD) failed
MPID_Alltoall(1134)..........................:
MPIDI_Alltoall_allcomm_composition_json(1065):
MPIDI_Alltoall_intra_composition_beta(1287)..:
MPIDI_NM_mpi_alltoall(290)...................:
MPIR_Alltoall_impl(3088).....................:
MPIR_Alltoall_allcomm_auto(3012).............:
MPIR_Alltoall_intra_brucks(110)..............:
MPIC_Sendrecv(263)...........................:
MPIC_Wait(90)................................:
MPIR_Wait(751)...............................:
MPIR_Wait_state(708).........................:
MPIDI_progress_test(142).....................:
MPIDI_OFI_handle_cq_error(786)...............: OFI poll failed (default nic=ibp63s0: Input/output error)
A few ideas:
- I probably don’t need a GPU in this request as I think everything is happening on the CPU, is that right?
- Maybe I underestimated the memory requirement for 3rd order 100m element mesh, so the NIC failure is actually a memory problem.
- Maybe the partitioning needs to match?