pyFR encounters a blocking issue when using Slurm

Hello, everyone.

I have been using Slurm to configure multiple GPU resources for running pyFR simulations recently. Through multi-GPU coordination, pyFR has been performing simulation tasks successfully and yielding satisfactory results.
However, when I use pyFR to calculate cases that require long-term averaging, the model is not very complex, and I need to calculate for a long time to obtain average results over a long time range.

During the calculation, pyFR experiences a blocking issue. This means that pyFR stops calculating, the default time progress bar gets stuck, and regardless of the real-time passage, the simulation time steps do not progress, and the estimated remaining calculation time remains frozen.
The Slurm settings seem to be in order, and I have carefully allocated computational resources, provided sufficient memory, and allowed Slurm to run for an extended period of time. I’m not sure why pyFR’s calculations are getting stuck.

My nvidia-smi shows that my task type is “C,” meaning it is actively computing, and there don’t seem to be any issues. When I use strace -p pid to check the system calls, it seems to be stuck, continuously displaying the following content:

epoll_wait(6, [], 32, 0)      = 0

It seems to be waiting for something, and in a normal non-blocking computation, I’m sure it shouldn’t be continuously displaying this content. What does it signify?
In fact, every time this issue occurs, it means several days of computational time are going to waste in real life, and I’m feeling quite frustrated. Where might the problem be originating, and how can I improve it?

Regards.

There could be any one of a number of issues. It is unlikely to do with slurm and more likely to do with a bad node or file system configuration issue.

This should not be the case if you are regularly writing checkpoint files (say every hour or so). Then if the simulation crashes (typically on larger simulations this is due to a node going down) you can simply restart. Collected statistics can then be merged together later as a post-processing step.

Regards, Freddie.

Thank you for your quick response.

I don’t have much knowledge about HPC systems, so I didn’t fully understand what you meant.

When you mention “bad nodes,” are you referring to a problem with a specific GPU node, and how should I troubleshoot it?
As for the “file system configuration issue,” are you talking about the ini of pyFR or the HPC file system configuration? If it’s related to HPC, I’m not proficient in that, so I’m not sure where to start.

More importantly, I don’t know what a checkpoint file is and how to write one.
My simulation didn’t crash; it just got stuck, and even though time is passing, the simulation time steps are not progressing. Does the output of “strace -p pid" help you?

I would greatly appreciate it if you could further clarify my doubts.
Regards.