Hello, everyone.
I have been using Slurm to configure multiple GPU resources for running pyFR simulations recently. Through multi-GPU coordination, pyFR has been performing simulation tasks successfully and yielding satisfactory results.
However, when I use pyFR to calculate cases that require long-term averaging, the model is not very complex, and I need to calculate for a long time to obtain average results over a long time range.
During the calculation, pyFR experiences a blocking issue. This means that pyFR stops calculating, the default time progress bar gets stuck, and regardless of the real-time passage, the simulation time steps do not progress, and the estimated remaining calculation time remains frozen.
The Slurm settings seem to be in order, and I have carefully allocated computational resources, provided sufficient memory, and allowed Slurm to run for an extended period of time. I’m not sure why pyFR’s calculations are getting stuck.
My nvidia-smi shows that my task type is “C,” meaning it is actively computing, and there don’t seem to be any issues. When I use strace -p pid to check the system calls, it seems to be stuck, continuously displaying the following content:
epoll_wait(6, [], 32, 0) = 0
It seems to be waiting for something, and in a normal non-blocking computation, I’m sure it shouldn’t be continuously displaying this content. What does it signify?
In fact, every time this issue occurs, it means several days of computational time are going to waste in real life, and I’m feeling quite frustrated. Where might the problem be originating, and how can I improve it?
Regards.