OSError: Unable to open file, failed import

Hi,

I was having a bit of trouble in running pyFR on clusters

When in my bash shell I run the following series of commands

conda init bash
source .bashrc
conda activate pyFR
ml gnu8 metis
cd PyFR-Test-Cases/2d-euler-vortex
pyfr import 2d-euler-vortex.msh 2d-euler-vortex.pyfrm
pyfr partition 2 2d-euler-vortex.pyfrm .
mpiexec -n 2 pyfr run -b openmp -p 2d-euler-vortex.pyfrm 2d-euler-vortex.ini

it works just fine and simulation ended correctly.

When I create che same workflow with an sbatch submission I get this error lines:

  fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 1760, sblock->base_addr = 0, stored_eof = 45284)```

The step which fails when running in sbatch is

pyfr import 2d-euler-vortex.msh 2d-euler-vortex.pyfrm

Moreover I have an issue with cuda backend and I’m not sure if it related only to its version (11.1 which is lower than the required one >=11.4 which I cannot update to at the moment)

func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /lib64/libcuda.so: undefined symbol: cuDeviceGetUuid_v2

Any idea on how to solve?

For the file issue can you please try running h5dump on the file in question (from sbatch) and seeing if it works?

The version requirements for various components are not chosen arbitrarily. We need 11.4 because some functions used by PyFR were only added in 11.4.

Regards, Freddie.

Running from sbatch

h5dump -d 2d-euler-vortex.msh -o dataset.txt

and I get

h5dump error: unable to open file "dataset.txt"

As it regards cuda error: according to you it is related to cuda version? or it may be something different other than that?

This suggests that the error is file system related, and not to do with PyFR. You will need to follow up with your system administrator to debug what might be the cause of this.

If you are using a CUDA version below 11.4 you will see errors.

Regards, Freddie.

I’m not sure about that, if I run from sbatch:

srun -n 1 test.sh

it works just fine (with openmp backend) but I’m limited to 1 process (see Writing MPS to .h5 file in a multi-core process - ITensor Support Q&A), so it should not be file system related

Can you confirm if you are getting the error when running PyFR or when trying to import a mesh? The former should only be run on a single rank.

Regards, Freddie.

The error only appears when I try to import the mesh with n>1

In case this could solve the problem, how I can dynamically change from -n 1 for mesh import to n = 100 (for example) when running the simulation?

Every command except for running a simulation should be performed on a single rank. Import the mesh, partition it, and then submit your batch job on as many ranks as you have used to partition the domain and have this execute pyfr run ....

Regards, Freddie.

1 Like

I’ll try and let you know

Regards, Frank