I am starting this to make sure I capture correctly the issue I am experiencing when using Ascent plugin. Apologies for the length. I tried to break it into section to make it more readable.
Using your suggestion I am now running my simulation with a minimal MPICH using OFI as network module. Makes me quite happy, but I believe I can be even happier. MPICH with UCX as network module shows significantly better number on OSU benchmark.
The error appears to be pointing towards memory allocation on a CUDA device. I tried this with mpi-type=cuda-aware with export MPIR_CVAR_ENABLE_GPU=1 and I also run this w
mpi-type=standard and export MPIR_CVAR_ENABLE_GPU=0.
Note, this is not Frontier. This is our own cluster Mary Coombs. I am attaching my two compilations and error I am getting below. Any advice would be greatly appreciated.
Full error message
With standard I get
[mcc1235:2499275:0:2499275] ucp_mm.c:1761 Assertion `mem_info.type == alloc_mem_type' failed: mem_info.mem_type=host alloc_mem_type=cuda
==== backtrace (tid:2499275) ====
0 0x00000000000534bb ucp_mm_get_alloc_md_index() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/core/ucp_mm.c:1761
1 0x00000000000a347c ucp_proto_rndv_get_mtype_probe() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/rndv/rndv_get.c:332
2 0x0000000000077d08 ucp_proto_select_init_protocols() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:220
3 0x0000000000078b98 ucp_proto_select_elem_init() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:478
4 0x000000000007a8e4 ucp_proto_select_lookup_slow() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:541
5 0x000000000009b828 ucp_proto_rndv_ctrl_select_remote_proto() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/rndv/proto_rndv.c:175
6 0x000000000009c8b3 ucp_proto_rndv_ctrl_probe() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/rndv/proto_rndv.c:446
7 0x00000000000af8b8 ucp_proto_rndv_rtr_probe() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/rndv/rndv_rtr.c:201
8 0x0000000000077d08 ucp_proto_select_init_protocols() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:220
9 0x0000000000078b98 ucp_proto_select_elem_init() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:478
10 0x000000000007a8e4 ucp_proto_select_lookup_slow() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:541
11 0x000000000009b828 ucp_proto_rndv_ctrl_select_remote_proto() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/rndv/proto_rndv.c:175
12 0x000000000009c8b3 ucp_proto_rndv_ctrl_probe() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/rndv/proto_rndv.c:446
13 0x000000000009cde5 ucp_proto_rndv_rts_probe() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/rndv/proto_rndv.c:553
14 0x0000000000077d08 ucp_proto_select_init_protocols() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:220
15 0x0000000000078b98 ucp_proto_select_elem_init() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:478
16 0x000000000007a8e4 ucp_proto_select_lookup_slow() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.c:541
17 0x00000000000f2310 ucp_proto_select_lookup() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_select.inl:98
18 0x00000000000f2310 ucp_proto_request_lookup_proto() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_common.inl:209
19 0x00000000000f2310 ucp_proto_request_send_op_common() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_common.inl:243
20 0x00000000000f2310 ucp_proto_request_send_op() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/proto/proto_common.inl:302
21 0x00000000000f2310 ucp_tag_send_nbx() /gpfs/mcc/apps_build/rxc01/rxc24-rxc01/ucx-1.19.0/src/ucp/tag/tag_send.c:295
22 0x000000000038551d MPIDI_NM_mpi_isend.constprop.0() helper_fns.c:0
23 0x0000000000388453 MPIC_Send() :0
24 0x0000000000300b9f MPIR_Reduce_intra_binomial() :0
25 0x0000000000372df5 MPIR_Reduce_allcomm_auto() :0
26 0x0000000000372f44 MPIR_Reduce_impl() :0
27 0x0000000000376574 MPIDI_Allreduce_intra_composition_alpha() mpir_coll.c:0
28 0x0000000000378213 MPIR_Allreduce() :0
29 0x0000000000195bfb MPI_Allreduce() ???:0
30 0x00000000003060ca ascent::VTKHCollection::field_topology() ???:0
31 0x00000000003573c3 ascent::runtime::filters::CreatePlot::execute() ???:0
32 0x0000000000025101 flow::Workspace::execute() ???:0
33 0x000000000012a7b8 ascent::AscentRuntime::Execute() ???:0
34 0x00000000000f4b6f ascent::Ascent::execute() ???:0
35 0x00000000000078d6 ffi_prep_go_closure() ???:0
36 0x0000000000004556 ???() /usr/lib64/libffi.so.8:0
37 0x0000000000015746 _call_function_pointer() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/./Modules/_ctypes/callproc.c:950
38 0x0000000000015746 _ctypes_callproc() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/./Modules/_ctypes/callproc.c:1301
39 0x000000000000fe2a PyCFuncPtr_call() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/./Modules/_ctypes/_ctypes.c:4382
40 0x00000000000e7130 _PyObject_MakeTpCall() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:242
41 0x00000000000e7130 _PyObject_MakeTpCall() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:244
42 0x000000000008f7ad _PyEval_EvalFrameDefault() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Python/generated_cases.c.h:813
43 0x00000000000e8b47 _PyObject_VectorcallDictTstate() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:135
44 0x00000000000e8d30 _PyObject_Call_Prepend() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:504
45 0x00000000001675a4 slot_tp_call() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/typeobject.c:9570
46 0x00000000000e7130 _PyObject_MakeTpCall() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:242
47 0x00000000000e7130 _PyObject_MakeTpCall() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:244
48 0x0000000000090b71 _PyEval_EvalFrameDefault() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Python/generated_cases.c.h:1843
49 0x0000000000090b71 _PyEval_EvalFrameDefault() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Python/generated_cases.c.h:1848
50 0x00000000000e8b47 _PyObject_VectorcallDictTstate() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:135
51 0x00000000000e8cd5 _PyObject_Call_Prepend() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:504
52 0x00000000001676f4 slot_tp_init() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/typeobject.c:9816
53 0x0000000000160248 type_call() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/typeobject.c:1997
54 0x00000000000e7130 _PyObject_MakeTpCall() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:242
55 0x00000000000e7130 _PyObject_MakeTpCall() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Objects/call.c:244
56 0x000000000008f7ad _PyEval_EvalFrameDefault() /gpfs/mcc/apps_build/rxc01/rja87-rxc01/Python3/Python-3.13.7/Python/generated_cases.c.h:813
=================================
e[0me[1me[91m
Loguru caught a signal: SIGABRT
e[0me[0me[31mStack trace:
63 0x401075 _start + 37
62 0x7f50dd0c7640 __libc_start_main + 128
61 0x7f50dd0c7590 /usr/lib64/libc.so.6(+0x29590) [0x7f50dd0c7590]
60 0x7f50dd5f85b9 Py_BytesMain + 41
59 0x7f50dd5f7e29 Py_RunMain + 2457
58 0x7f50dd5d3e3b /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x251e3b) [0x7f50dd5d3e3b]
57 0x7f50dd5d382f /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x25182f) [0x7f50dd5d382f]
56 0x7f50dd5d1933 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x24f933) [0x7f50dd5d1933]
55 0x7f50dd5d15f7 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x24f5f7) [0x7f50dd5d15f7]
54 0x7f50dd5d1136 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x24f136) [0x7f50dd5d1136]
53 0x7f50dd578efe PyEval_EvalCode + 158
52 0x7f50dd4117ad _PyEval_EvalFrameDefault + 15789
51 0x7f50dd469130 _PyObject_MakeTpCall + 144
50 0x7f50dd4e2248 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x160248) [0x7f50dd4e2248]
49 0x7f50dd4e96f4 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x1676f4) [0x7f50dd4e96f4]
48 0x7f50dd46acd5 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0xe8cd5) [0x7f50dd46acd5]
47 0x7f50dd46ab47 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0xe8b47) [0x7f50dd46ab47]
46 0x7f50dd412b71 _PyEval_EvalFrameDefault + 20849
45 0x7f50dd469130 _PyObject_MakeTpCall + 144
44 0x7f50dd4e95a4 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x1675a4) [0x7f50dd4e95a4]
43 0x7f50dd46ad30 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0xe8d30) [0x7f50dd46ad30]
42 0x7f50dd46ab47 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0xe8b47) [0x7f50dd46ab47]
41 0x7f50dd4117ad _PyEval_EvalFrameDefault + 15789
40 0x7f50dd469130 _PyObject_MakeTpCall + 144
39 0x7f505b880e2a /gpfs/mcc/apps/python3/3.13/lib/python3.13/lib-dynload/_ctypes.cpython-313-x86_64-linux-gnu.so(+0xfe2a) [0x7f505b880e2a]
38 0x7f505b886746 /gpfs/mcc/apps/python3/3.13/lib/python3.13/lib-dynload/_ctypes.cpython-313-x86_64-linux-gnu.so(+0x15746) [0x7f505b886746]
37 0x7f505b869556 /usr/lib64/libffi.so.8(+0x4556) [0x7f505b869556]
36 0x7f505b86c8d6 /usr/lib64/libffi.so.8(+0x78d6) [0x7f505b86c8d6]
35 0x7f4daea8eb6f ascent::Ascent::execute(conduit::Node const&) + 543
34 0x7f4daeac47b8 ascent::AscentRuntime::Execute(conduit::Node const&) + 1064
33 0x7f50880d7101 flow::Workspace::execute() + 817
32 0x7f4daecf13c3 ascent::runtime::filters::CreatePlot::execute() + 1667
31 0x7f4daeca00ca ascent::VTKHCollection::field_topology(std::string) + 202
30 0x7f5051664bfb PMPI_Allreduce + 1211
29 0x7f5051847213 /gpfs/mcc/HT07881/uxq12/shared/apps/compiler/gcc/11.4/mpich/4.3.2-ucx/lib/libmpi.so.12(+0x378213) [0x7f5051847213]
28 0x7f5051845574 /gpfs/mcc/HT07881/uxq12/shared/apps/compiler/gcc/11.4/mpich/4.3.2-ucx/lib/libmpi.so.12(+0x376574) [0x7f5051845574]
27 0x7f5051841f44 /gpfs/mcc/HT07881/uxq12/shared/apps/compiler/gcc/11.4/mpich/4.3.2-ucx/lib/libmpi.so.12(+0x372f44) [0x7f5051841f44]
26 0x7f5051841df5 /gpfs/mcc/HT07881/uxq12/shared/apps/compiler/gcc/11.4/mpich/4.3.2-ucx/lib/libmpi.so.12(+0x372df5) [0x7f5051841df5]
25 0x7f50517cfb9f /gpfs/mcc/HT07881/uxq12/shared/apps/compiler/gcc/11.4/mpich/4.3.2-ucx/lib/libmpi.so.12(+0x300b9f) [0x7f50517cfb9f]
24 0x7f5051857453 /gpfs/mcc/HT07881/uxq12/shared/apps/compiler/gcc/11.4/mpich/4.3.2-ucx/lib/libmpi.so.12(+0x388453) [0x7f5051857453]
23 0x7f505185451d /gpfs/mcc/HT07881/uxq12/shared/apps/compiler/gcc/11.4/mpich/4.3.2-ucx/lib/libmpi.so.12(+0x38551d) [0x7f505185451d]
22 0x7f505116d310 ucp_tag_send_nbx + 7040
21 0x7f50510f58e4 ucp_proto_select_lookup_slow + 228
20 0x7f50510f3b98 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0x78b98) [0x7f50510f3b98]
19 0x7f50510f2d08 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0x77d08) [0x7f50510f2d08]
18 0x7f5051117de5 ucp_proto_rndv_rts_probe + 277
17 0x7f50511178b3 ucp_proto_rndv_ctrl_probe + 643
16 0x7f5051116828 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0x9b828) [0x7f5051116828]
15 0x7f50510f58e4 ucp_proto_select_lookup_slow + 228
14 0x7f50510f3b98 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0x78b98) [0x7f50510f3b98]
13 0x7f50510f2d08 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0x77d08) [0x7f50510f2d08]
12 0x7f505112a8b8 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0xaf8b8) [0x7f505112a8b8]
11 0x7f50511178b3 ucp_proto_rndv_ctrl_probe + 643
10 0x7f5051116828 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0x9b828) [0x7f5051116828]
9 0x7f50510f58e4 ucp_proto_select_lookup_slow + 228
8 0x7f50510f3b98 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0x78b98) [0x7f50510f3b98]
7 0x7f50510f2d08 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0x77d08) [0x7f50510f2d08]
6 0x7f505111e47c /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucp.so.0(+0xa347c) [0x7f505111e47c]
5 0x7f50510ce4bb ucp_mm_get_alloc_md_index + 395
4 0x7f5050ee86df ucs_fatal_error_format + 207
3 0x7f5050ee8607 /gpfs/mcc/apps/gcc/ucx/1.19.0/lib/libucs.so.0(+0x66607) [0x7f5050ee8607]
2 0x7f50dd0c67f3 abort + 211
1 0x7f50dd0dc646 raise + 22
0 0x7f50dd12994c /usr/lib64/libc.so.6(+0x8b94c) [0x7f50dd12994c]
2026-01-13 15:47:01.330 ( 5.861s) [main thread ] :0 FATL| Signal: SIGABRT
with cuda-aware a much more succinct:
Loguru caught a signal: SIGSEGV
Stack trace:
25 0x401075 _start + 37
24 0x7fb8d4454640 __libc_start_main + 128
23 0x7fb8d4454590 /usr/lib64/libc.so.6(+0x29590) [0x7fb8d4454590]
22 0x7fb8d49855b9 Py_BytesMain + 41
21 0x7fb8d4984e29 Py_RunMain + 2457
20 0x7fb8d4960e3b /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x251e3b) [0x7fb8d4960e3b]
19 0x7fb8d496082f /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x25182f) [0x7fb8d496082f]
18 0x7fb8d495e933 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x24f933) [0x7fb8d495e933]
17 0x7fb8d495e5f7 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x24f5f7) [0x7fb8d495e5f7]
16 0x7fb8d495e136 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x24f136) [0x7fb8d495e136]
15 0x7fb8d4905efe PyEval_EvalCode + 158
14 0x7fb8d479e7ad _PyEval_EvalFrameDefault + 15789
13 0x7fb8d47f6130 _PyObject_MakeTpCall + 144
12 0x7fb8d486f248 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x160248) [0x7fb8d486f248]
11 0x7fb8d48766f4 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0x1676f4) [0x7fb8d48766f4]
10 0x7fb8d47f7cd5 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0xe8cd5) [0x7fb8d47f7cd5]
9 0x7fb8d47f7b47 /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0xe8b47) [0x7fb8d47f7b47]
8 0x7fb8d479b20e _PyEval_EvalFrameDefault + 2062
7 0x7fb8d47f93cf /gpfs/mcc/apps/python3/3.13/lib/libpython3.13.so.1.0(+0xea3cf) [0x7fb8d47f93cf]
6 0x7fb8d479fb71 _PyEval_EvalFrameDefault + 20849
5 0x7fb8d47f63b1 PyObject_Vectorcall + 81
4 0x7fb851a3959e /gpfs/mcc/HT07881/uxq12/shared/venvs/pyfr-mpich-ucx/lib/python3.13/site-packages/mpi4py/MPI.cpython-313-x86_64-linux-gnu.so(+0xe259e) [0x7fb851a3959e]
3 0x7fb848ac32d8 MPI_Start + 1080
2 0x7fb848ac20b5 /gpfs/mcc/HT07881/uxq12/shared/apps/compiler/gcc/11.4/mpich/4.3.2-ucx/lib/libmpi.so.12(+0x2660b5) [0x7fb848ac20b5]
1 0x7fb8484f88a6 ucp_tag_send_nbx + 278
0 0x7fb890dfe33e uct_dc_mlx5_ep_am_short + 4318
2026-01-13 16:16:43.245 ( 3.995s) [main thread ] :0 FATL| Signal: SIGSEGV
MPICH configurations
OFI and PMIX only:
./configure \
--with-pmix=/gpfs/mcc/apps/gcc/pmix/5.0.9 \
--with-device=ch4:ofi \
...
PMIX, UCX and CUDA:
./configure \
--with-pmix=/gpfs/mcc/apps/gcc/pmix/5.0.9 \
--with-device=ch4:ucx \
--with-ucx=/gpfs/mcc/apps/gcc/ucx/1.19.0 \
--with-cuda=$CUDA_DIR \
...
When I do OSU benchmarks I get consistently better numbers for UCX compared to OFI. Bandwidth is 2x and latency 2x lower for large messages. OFI also requires explicit FI_PROVIDER=verbs as otherwise it default to TCP.