Environment for OLCF Frontier for develop branch?

Hello,

Could you advise on what would be a good environment setup for OLCF Frontier to run the develop branch of PyFR? I’ve configured everything I needed (see below) but when launching my test job with 8 rank/single node PyFR process with srun:

srun -N1 -n8 -c1 --gpus-per-task=1 --gpu-bind=closest \
    pyfr run --backend hip mesh.pyfrm conf.ini

I get an error about a bad address.

process_vm_readv: Bad address
Assertion failed in file ../src/mpid/ch4/shm/cray_common/cray_common_memops.c at line 461: 0
process_vm_readv: Bad address
Assertion failed in file ../src/mpid/ch4/shm/cray_common/cray_common_memops.c at line 461: 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7fffec1b5bbb]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1c48264) [0x7fffeba5a264]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x21dc6c0) [0x7fffebfee6c0]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x21f86b7) [0x7fffec00a6b7]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x21d4e75) [0x7fffebfe6e75]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0xd89094) [0x7fffeab9b094]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0xdd3e35) [0x7fffeabe5e35]
/opt/cray/pe/lib64/libmpi_cray.so.12(PMPI_Waitall+0x3d1) [0x7fffeabe6631]
/lustre/orion/cfd219/scratch/rsawko/venvs/pyfr-with-ascent/lib/python3.13/site-packages/mpi4py/MPI.cpython-313-x86_64-linux-gnu.so(+0x118473) [0x7fffe2e93473]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(PyObject_Vectorcall+0x4f) [0x7fffed4c0d2f]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(_PyEval_EvalFrameDefault+0x1980) [0x7fffed45a150]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(PyEval_EvalCode+0x135) [0x7fffed6115f5]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(+0x2ad97e) [0x7fffed67697e]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(+0x2adc27) [0x7fffed676c27]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(+0x2afb56) [0x7fffed678b56]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(+0x2b012c) [0x7fffed67912c]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(Py_RunMain+0x9b9) [0x7fffed6a1ad9]
/sw/frontier/spack-envs/core-25.03/opt/gcc-13.2/python-3.13.0-z6cvwh43maa5kvglquodiadny7ohzirp/lib/libpython3.13.so.1.0(Py_BytesMain+0x45) [0x7fffed6a22b5]
/lib64/libc.so.6(+0x40e6c) [0x7fffed0e1e6c]
/lib64/libc.so.6(__libc_start_main+0x87) [0x7fffed0e1f35]
/lustre/orion/cfd219/scratch/rsawko/venvs/pyfr-with-ascent/bin/python(_start+0x21) [0x400da1]

My current setup

I think the following modules should work:

  1) libfabric/1.22.0                14) darshan-runtime/3.4.6-mpi    (E4S)
  2) craype-network-ofi              15) hsi/default
  3) perftools-base/24.11.0          16) lfs-wrapper/0.0.1
  4) xpmem/2.11.3-1.3_gdbda01a1eb3d  17) DefApps
  5) cray-pmi/6.1.15                 18) python/3.13.0
  6) cce/18.0.1                      19) rocm/6.4.1
  7) craype/2.7.33                   20) craype-x86-milan
  8) cray-dsmml/0.3.0                21) cray-hdf5-parallel/1.12.2.11
  9) cray-mpich/8.1.31               22) craype-accel-amd-gfx90a
 10) cray-libsci/24.11.0             23) conduit/0.9.5
 11) PrgEnv-cray/8.6.0               24) vtkm/2.3.0
 12) Core/25.03                      25) ascent/0.9.5
 13) tmux/3.4

where conduit, vtkm and ascent were compiled manually given the setup above. I then create my own virtual environment install mpi4py

pip install --no-binary=mpi4py mpi4py

Follow the notes here to install h5py agaist the cray HDF5 from modules and finally install PyFR with

pip install git+https://github.com/PyFR/PyFR.git@develop

Can you confirm if everything works fine without Ascent? With the develop version there is no need for parallel HDF5, so you can just install h5py using pip using its embedded HDF5 without issue.

Also, can you confirm that HIP aware MPI is not being used.

If you are still encountering an exception can you try with plain MPICH compiled yourself (running on a single node so the fabric doesn’t matter).

Regards, Freddie.

1 Like

Thanks! hip-aware must have been the culprit in this case. I just changed it to standard though now I am reproducing my other problem about temporary file names.

Please file a bug with the admin’s as we’ve encountered similar issues on Frontier in the past (I think we noted in our PyFR v2 paper that HIP aware MPI could not be used due to bugs).

1 Like