Freezing at Start-up

Hi,

I have been having this issue running my channel flow case where the simulation completely freezes up. I know this because I have modified the source code to print out statements for every few iterations so I know what time step it is on, and the progress bar is completely freezing as well. I am running my case across 10 nodes on HPC cluster with 20 total GPUs. My mesh is somewhat large (~2,000,000 elements). I have been running my case for 4 hours successfully, and need to restart it from my last .pyfrs file, but it is immediately freezing after ~20 iterations or so. This is not making sense to me, as it is the same exact case I ran just a few hours ago with no troubles at all. This has also been happening to me when trying to run other cases as well.

I know I saw somewhere on the forum that rebuilding UCX and OpenMPI solved the issue, but that did not work for me. I looked at the backtrace output for my running processes and see that the backtrace indicates that the process is stuck in a low-level MPI communication call involving UCX. Here is some of the backtrace: Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_han.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_inter.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_inter.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_libnbc.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_libnbc.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_self.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_self.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_sm.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_sm.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/libmca_common_sm.so.40…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/libmca_common_sm.so.40
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_sync.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_sync.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_tuned.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_tuned.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_cuda.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_cuda.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_hcoll.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_hcoll.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/libhcoll.so.1…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/libhcoll.so.1
Reading symbols from /lib64/libnsl.so.1…Reading symbols from /usr/lib/debug/usr/lib64/libnsl-2.17.so.debug…done.
done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/libocoms.so.0…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/libocoms.so.0
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_sm.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_sm.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_pt2pt.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_pt2pt.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_rdma.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_rdma.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_ucx.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_ucx.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_rcache_ucs.so…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_rcache_ucs.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_bcol_basesmuma.so…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_bcol_basesmuma.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_mcast_vmc.so…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_mcast_vmc.so
Reading symbols from /lib64/libcuda.so…(no debugging symbols found)…done.
Loaded symbols for /lib64/libcuda.so
Reading symbols from /apps/cuda/12.3.0/lib64/libnvrtc.so…(no debugging symbols found)…done.
Loaded symbols for /apps/cuda/12.3.0/lib64/libnvrtc.so
Reading symbols from /apps/cuda/12.3.0/lib64/libcublasLt.so…(no debugging symbols found)…done.
Loaded symbols for /apps/cuda/12.3.0/lib64/libcublasLt.so
Reading symbols from /apps/cuda/12.3.0/lib64/libnvrtc-builtins.so.12.3…(no debugging symbols found)…done.
Loaded symbols for /apps/cuda/12.3.0/lib64/libnvrtc-builtins.so.12.3
0x00002abafe7bc209 in uct_rc_mlx5_iface_progress_cyclic (arg=0x1ad0030)
at rc/accel/rc_mlx5_iface.c:191
191 rc/accel/rc_mlx5_iface.c: No such file or directory.
warning: File “/apps/gnu/8.4.0/lib64/libstdc++.so.6.0.25-gdb.py” auto-loading has been declined by your auto-load safe-path’ set to “$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py”.
To enable execution of this file add
add-auto-load-safe-path /apps/gnu/8.4.0/lib64/libstdc++.so.6.0.25-gdb.py
line to your configuration file.
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file.
For more information about this security protection see the “Auto-loading safe path” section in the GDB manual. E.g., run from the shell:
info “(gdb)Auto-loading safe path”
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-326.el7_9.3.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libffi-3.0.13-19.el7.x86_64 libibverbs-58mlnx43-1.58307.x86_64 libnl3-3.2.28-4.el7.x86_64 libpciaccess-0.14-1.el7.x86_64 librdmacm-58mlnx43-1.58307.x86_64 nss-softokn-freebl-3.90.0-6.el7_9.x86_64 numactl-libs-2.0.12-5.el7_9.x86_64 nvidia-driver-latest-dkms-cuda-libs-535.104.05-1.el7.x86_64 sssd-client-1.16.5-10.el7_9.16.x86_64 systemd-libs-219-78.el7_9.9.x86_64 xz-libs-5.2.2-2.el7_9.x86_64 zlib-1.2.7-21.el7_9.x86_64
(gdb) bt
#0 0x00002abafe7bc209 in uct_rc_mlx5_iface_progress_cyclic (arg=0x1ad0030)
at rc/accel/rc_mlx5_iface.c:191
#1 0x00002abacc7301ca in ucs_callbackq_dispatch (cbq=)
at /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/ucs/datastruct/callbackq.h:211
#2 uct_worker_progress (worker=)
at /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/uct/api/uct.h:2647
#3 ucp_worker_progress (worker=0x15d1740) at core/ucp_worker.c:2804
#4 0x00002abb06afbed4 in opal_progress () at runtime/opal_progress.c:231
#5 0x00002abb03e430a6 in ompi_request_default_wait_all (count=28114992,
requests=0x0, statuses=0x2abafe7bc200 <uct_rc_mlx5_iface_progress_cyclic>)
at …/opal/threads/wait_sync.h:83
#6 0x00002abb03e71c63 in PMPI_Waitall (count=28114992, requests=0x0,
statuses=0x2abafe7bc200 <uct_rc_mlx5_iface_progress_cyclic>)
at pwaitall.c:80
#7 0x00002abb038cca10 in __pyx_pw_6mpi4py_3MPI_7Request_27Waitall (
__pyx_v_cls=0x1ad0030, __pyx_args=0x0, __pyx_nargs=46982621807104,
__pyx_kwds=0x2abb0acd5130 <mca_pml_ucx_progress>)
at src/mpi4py/MPI.c:143690
#8 0x00000000005c55e3 in _PyObject_VectorcallTstate (kwnames=,
nargsf=2, args=0x2abbdd9f81d8, callable=0x2abb02fe5e50, tstate=0xb31460)
at ./Include/cpython/abstract.h:114
—Type to continue, or q to quit—

Just for the purposes of debugging please force OpenMPI/UCX to use TCP as a transport rather than Infiniband. Note that this will have a negative impact on performance (maybe 3-4x) but you should still be able to see if the problem still occurs.

The syntax for this can vary but passing --mca btl tcp,self (or something along those lines) should do it.

Regards, Freddie.

1 Like

Thank you. I am using srun, so I set export OMPI_MCA_btl=tcp,self and it is running now. Now that we know what the issue is, is there anything I can do to not take the 3-4x performance penalty?

The problem is basically guaranteed to be with how Infiniband is configured on your cluster. PyFR is entirely ignorant of how things are actually transported over MPI (the MPI calls we make are always the same irrespective of what transport is being used). You will need to work with your system admin to diagnose the issue.

Regards, Freddie.

1 Like