Hi,
I have been having this issue running my channel flow case where the simulation completely freezes up. I know this because I have modified the source code to print out statements for every few iterations so I know what time step it is on, and the progress bar is completely freezing as well. I am running my case across 10 nodes on HPC cluster with 20 total GPUs. My mesh is somewhat large (~2,000,000 elements). I have been running my case for 4 hours successfully, and need to restart it from my last .pyfrs file, but it is immediately freezing after ~20 iterations or so. This is not making sense to me, as it is the same exact case I ran just a few hours ago with no troubles at all. This has also been happening to me when trying to run other cases as well.
I know I saw somewhere on the forum that rebuilding UCX and OpenMPI solved the issue, but that did not work for me. I looked at the backtrace output for my running processes and see that the backtrace indicates that the process is stuck in a low-level MPI communication call involving UCX. Here is some of the backtrace: Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_han.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_inter.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_inter.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_libnbc.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_libnbc.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_self.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_self.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_sm.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_sm.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/libmca_common_sm.so.40…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/libmca_common_sm.so.40
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_sync.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_sync.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_tuned.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_tuned.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_cuda.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_cuda.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_hcoll.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_coll_hcoll.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/libhcoll.so.1…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/libhcoll.so.1
Reading symbols from /lib64/libnsl.so.1…Reading symbols from /usr/lib/debug/usr/lib64/libnsl-2.17.so.debug…done.
done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/libocoms.so.0…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/libocoms.so.0
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_sm.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_sm.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_pt2pt.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_pt2pt.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_rdma.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_rdma.so
Reading symbols from /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_ucx.so…done.
Loaded symbols for /apps/openmpi-cuda/intel/19.0/4.1.2-hpcx/lib/openmpi/mca_osc_ucx.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_rcache_ucs.so…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_rcache_ucs.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_bcol_basesmuma.so…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_bcol_basesmuma.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so
Reading symbols from /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_mcast_vmc.so…(no debugging symbols found)…done.
Loaded symbols for /apps/hpcx/2.12.0/hcoll/lib/hcoll/hmca_mcast_vmc.so
Reading symbols from /lib64/libcuda.so…(no debugging symbols found)…done.
Loaded symbols for /lib64/libcuda.so
Reading symbols from /apps/cuda/12.3.0/lib64/libnvrtc.so…(no debugging symbols found)…done.
Loaded symbols for /apps/cuda/12.3.0/lib64/libnvrtc.so
Reading symbols from /apps/cuda/12.3.0/lib64/libcublasLt.so…(no debugging symbols found)…done.
Loaded symbols for /apps/cuda/12.3.0/lib64/libcublasLt.so
Reading symbols from /apps/cuda/12.3.0/lib64/libnvrtc-builtins.so.12.3…(no debugging symbols found)…done.
Loaded symbols for /apps/cuda/12.3.0/lib64/libnvrtc-builtins.so.12.3
0x00002abafe7bc209 in uct_rc_mlx5_iface_progress_cyclic (arg=0x1ad0030)
at rc/accel/rc_mlx5_iface.c:191
191 rc/accel/rc_mlx5_iface.c: No such file or directory.
warning: File “/apps/gnu/8.4.0/lib64/libstdc++.so.6.0.25-gdb.py” auto-loading has been declined by your auto-load safe-path’ set to “$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py”.
To enable execution of this file add
add-auto-load-safe-path /apps/gnu/8.4.0/lib64/libstdc++.so.6.0.25-gdb.py
line to your configuration file.
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file.
For more information about this security protection see the “Auto-loading safe path” section in the GDB manual. E.g., run from the shell:
info “(gdb)Auto-loading safe path”
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-326.el7_9.3.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libffi-3.0.13-19.el7.x86_64 libibverbs-58mlnx43-1.58307.x86_64 libnl3-3.2.28-4.el7.x86_64 libpciaccess-0.14-1.el7.x86_64 librdmacm-58mlnx43-1.58307.x86_64 nss-softokn-freebl-3.90.0-6.el7_9.x86_64 numactl-libs-2.0.12-5.el7_9.x86_64 nvidia-driver-latest-dkms-cuda-libs-535.104.05-1.el7.x86_64 sssd-client-1.16.5-10.el7_9.16.x86_64 systemd-libs-219-78.el7_9.9.x86_64 xz-libs-5.2.2-2.el7_9.x86_64 zlib-1.2.7-21.el7_9.x86_64
(gdb) bt
#0 0x00002abafe7bc209 in uct_rc_mlx5_iface_progress_cyclic (arg=0x1ad0030)
at rc/accel/rc_mlx5_iface.c:191
#1 0x00002abacc7301ca in ucs_callbackq_dispatch (cbq=)
at /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/ucs/datastruct/callbackq.h:211
#2 uct_worker_progress (worker=)
at /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat7-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/uct/api/uct.h:2647
#3 ucp_worker_progress (worker=0x15d1740) at core/ucp_worker.c:2804
#4 0x00002abb06afbed4 in opal_progress () at runtime/opal_progress.c:231
#5 0x00002abb03e430a6 in ompi_request_default_wait_all (count=28114992,
requests=0x0, statuses=0x2abafe7bc200 <uct_rc_mlx5_iface_progress_cyclic>)
at …/opal/threads/wait_sync.h:83
#6 0x00002abb03e71c63 in PMPI_Waitall (count=28114992, requests=0x0,
statuses=0x2abafe7bc200 <uct_rc_mlx5_iface_progress_cyclic>)
at pwaitall.c:80
#7 0x00002abb038cca10 in __pyx_pw_6mpi4py_3MPI_7Request_27Waitall (
__pyx_v_cls=0x1ad0030, __pyx_args=0x0, __pyx_nargs=46982621807104,
__pyx_kwds=0x2abb0acd5130 <mca_pml_ucx_progress>)
at src/mpi4py/MPI.c:143690
#8 0x00000000005c55e3 in _PyObject_VectorcallTstate (kwnames=,
nargsf=2, args=0x2abbdd9f81d8, callable=0x2abb02fe5e50, tstate=0xb31460)
at ./Include/cpython/abstract.h:114
—Type to continue, or q to quit—