Indefinite initialization with large mesh case

Hello,

I have been using PyFR on TACC Stampede3, and have been issues with running a model with a mesh size of approximately 2E8 elements. Namely, it seems that the model does not make any stepping progress and is stuck in the initialization phase indefinitely (+48 hrs). I have tried varying decomposition sizes without success up to 200 compute nodes. This is a first order case running on openmp backend.

All other smaller PyFR cases I have run in this environment/setup work fine so it seems to be a size induced issue. I would appreciate any suggestions/advice on addressing this issue and diagnosing if the start-up is just slow or stalling somewhere. Would the start-up optimizations on the performance tuning page improve this?

Thank you.

PyFR has recently been impacted by several bugs in NumPy (well, actually OpenBLAS) which can cause deadlocks on start-up. Can you confirm your NumPy version? Additionally, would you be able to attach a debugger to a stalled rank so we can get a backtrace.

A known workaround (assuming you can not switch to a more recent version of NumPy) is to set OMP_NUM_THREADS=1. This isn’t ideal if you’re running with the OpenMP backend as it means you will now need to switch to one rank per core.

Regards, Freddie.

Freddie,

My NumPy version is 2.3.2. Could you confirm if this version is problematic and which version to switch to? I can work towards getting a debugger set up; in the meantime, I will submit a run with your workaround to see if it helps.

Thank you for your help.

From the bug report:

This is a known bad version.

Regards, Freddie.

1 Like

Just following up here to see if this resolved the issue?

Regards, Freddie.

Freddie,

I apologize for the delay; I’ve been waiting for the job to run. Yes, I can confirm that changing out of a bad NumPy version (from 2.3.2 to 2.4.0) has resolved this issue. Additionally, in the past I have noticed occasional freezes upon start-up or at an intermediate solving time when running smaller separate jobs simultaneously. Could I attribute these freezes to this NumPy bug or is there anything else I have to take into account when running multiple PyFR cases simultaneously?

Thank you.

I would say it is highly likely. The only other candidate would be a deadlock due to async file writing (we’ve had some reports of issues although it seems to be due to misconfigured systems).

Note that there is a slight chance that another NumPy BLAS bug is lurking. When I get a reproducer I’ll report it to the NumPy team.

Regards, Freddie.

Thank you for the help! I will keep an eye out for further issues/updates regarding NumPy.

Strange, for me it’s the other way around (Numpy 2.4.0 hangs but 2.3.2 from Intel distribution works).

Edit; I notice the GitHub issue linked above. If it is openblas related, then that might explain why the numpy from Intel worked (I assume it uses MKL).

Can you attach a debugger for a case where 2.4.0 hangs and then get a backtrace? This will let you open up an issue on the NumPy repository.

Yes, I do expect Intel to build against MKL which does not have the same threading issues as OpenBLAS.

Regards, Freddie.

I have also been experiencing issues again (on 2.4.0) and am currently ruling out if these are system related stability issues. I will update this thread if I find more definitively that this is numpy related.

Did you have any luck catching NumPy hanging?

Regards, Freddie.

So far, no. I have not observed any hanging from my moderately sized cases; the largest case I have, which experienced the most hanging, has yet to leave the system job queue. The system’s stability issues appear to have been resolved so the next hang I get will likely indicate a NumPy issue (on 2.4.0) and I will follow up with a debugger.

Thank you, Jay