saving-restarting worked also with two ranks (lustre, 3d-triangular-airfoil)
Regards, Kenan
saving-restarting worked also with two ranks (lustre, 3d-triangular-airfoil)
Regards, Kenan
Do you have a small mixed element case you could test? I am trying to gradually increase the complexity to see where things first start to break fro you. This may give us a clue where the error is.
Regards, Freddie.
Dear Freddie
Sorry, it took time to test it again..
I have just tested a smaller mixed-element case, which resulted in a similar failure on lustre even with the async-timeout=0 setting. Note that the case is only relatively small, consisting of 266K hex/pri mixed p3 elements, partitioned into 8 mpi tasks.
I know this might be a bit of a pain, but would you be able to determine if it is just a single rank that is at fault?
More specifically:
We are extremely interested in getting to the bottom of this.
Regards, Freddie.
Hi @subutai,
Are you able to share the small mixed-element case? I can try running it on a cluster with a Lustre filesystem I have access to so I can reproduce the error and then help debug the issue.
Thanks,
Toby
Hi @subutai,
I’ve just reproduced the error on another mesh, I’ll look into it and see if I can work out what the issue is.
Toby
We’ve pushed what we believe to be a fix for the issue into the release candidate branch. Feedback would be appreciated.
Regards, Freddie.
My test case worked after the fix (without async-timeout=0).
Thanks!