Segfault at LargeScale

Hi,

I am trying to run on LUMI using ~1000 GPUs but I just get segmentation fault without any error or warnings. The same case just works fine if I reduce the GPU count. Did you ever encounter such an issue? Is it MPI related? Is it known? How did you deal with it?

Best

If there is a segfault you should be able to get a backtrace from the offending node and see where it crashed.

In our experience these issues are usually MPI related (OpenMPI in particular can have issues especially with HIP aware MPI). There are some lingering problems on AMD related to ROCm issues, although these do not result in segfaults (rather HIP errors).

Regards, Freddie.