Caveat: I think I am working on a cluster where Slurm appears to be misconfigured, but I would like to open this anyhow and ask for advice.
Thanks to @WillT, I have the Ascent plugin working and producing output in a serial run for Taylor Green vortex case. When running in parallel even on a single node:
salloc --time 30:00 --ntasks=4 --gpus-per-task=1 --gpus-per-node=4 --hint=nomultithread
mpirun pyfr --progress run --backend cuda mesh.pyfrm conf.ini
I get
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
...
[Error] Ascent::publish
file: /gpfs/mcc/HCBEACH03/rxb148/rrs59-rxb148/apps/ascent/ascent-v0.9.5/src/libs/ascent/runtimes/ascent_main_runtime.cpp
line: 677
message:
Local Domain IDs are not unique on ranks: 0 1 2 3
I believe the relevant part of the conflict is this part:
[backend]
precision = single
rank-allocator = linear
[backend-cuda]
device-id = local-rank
mpi-type = standard
Do you have a single element type case you can test with?
It might be that a recent change to Ascent has started to require that domain IDs be unique even on the same rank (although looking through the changelog I canāt see anything obvious here).
Just a note, I have access to two clusters: OLCF Frontier and STFC Mary Coombs. This backtrace above is from Mary with OpenMPI5, but I can rerun the same on Frontier and compare. I can obviously setup plain MPICH on both too, but it would mean I would need to recompile HDF5 too.
Iām not sure why you would need to recompile HDF5. Current PyFR does not make use of any of the parallel functionality of HDF5 and so any build will do.
For segfaults in this area we only accept the output of MPICH; everything else (vendor variants of MPICH and any builds of OpenMPI) are just far too buggy.
@fdw , sorry I only came back to this today as I had to focus on generating some results on coarse meshes when Ascent wasnāt critical. I am going to update PyFR and attempt it today and if still failing will do the MPICH as per your previous suggestion.
This has indeed fixed the mixed elements for me in the serial run and I could see the entire cross-section as expected. Thanks for a speedy PR.
Reproducing on pure MPICH, compiled with gcc 11 i.e. nothing but the --prefix=... in the configure and all but essential Conduit, VTKM and Ascent configures works in parallel too. On top of OpenMPI, which is supplied with this cluster, as well as MPICH on Frontier I still get either an error about local ranks (Frontier) or (Mary Coombs) an inscrutable backtrace involving calls to UCX.
I will need to dive in and produce some results for Thursday with what Iāve got but will document what I am seeing with these other installations. What would you suggest though? Should I try to chase the error with OMPI/Frontier MPICH?
Are you sure that on Frontier you are indeed running in parallel (check by having each rank dump itās ID). Then, confirm you are indeed using the latest version of PyFR on Frontier and not an older version by mistake.
Revisiting this today I found the source of error on Frontier. I wasnāt in fact using the āGPU-awareā or indeed Slingshot when running on many nodes. I needed to force GPU Transport Layer by pre loading the library.