restarting from partitioned mesh fails

saving-restarting worked also with two ranks (lustre, 3d-triangular-airfoil)

Regards, Kenan

Do you have a small mixed element case you could test? I am trying to gradually increase the complexity to see where things first start to break fro you. This may give us a clue where the error is.

Regards, Freddie.

Dear Freddie

Sorry, it took time to test it again..

I have just tested a smaller mixed-element case, which resulted in a similar failure on lustre even with the async-timeout=0 setting. Note that the case is only relatively small, consisting of 266K hex/pri mixed p3 elements, partitioned into 8 mpi tasks.

I know this might be a bit of a pain, but would you be able to determine if it is just a single rank that is at fault?

More specifically:

We are extremely interested in getting to the bottom of this.

Regards, Freddie.

Hi @subutai,

Are you able to share the small mixed-element case? I can try running it on a cluster with a Lustre filesystem I have access to so I can reproduce the error and then help debug the issue.

Thanks,
Toby

Hi @subutai,

I’ve just reproduced the error on another mesh, I’ll look into it and see if I can work out what the issue is.

Toby

1 Like

We’ve pushed what we believe to be a fix for the issue into the release candidate branch. Feedback would be appreciated.

Regards, Freddie.

My test case worked after the fix (without async-timeout=0).

Thanks!