The file system is Lustre, as explained below under the title “Lustre-RZG“:
I have just tried the stripe count & size adjustment, as in the example at the bottom of the page. No idea what the ideal values should be.
Next time, maybe i can try another file system available. that is, the “ceph-ssd“ files system.
Thanks!
Kenan
Update: on another filesystem (ceph-ssd), the job failed while writing the t=0 solution output:
Traceback (most recent call last):
File "/mnt/ceph-hdd/cold/nii00228/PyFR/ENV/pyfr_v2.1.0dev/bin/pyfr", line 8, in <module>
sys.exit(main())
^^^^^^
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 242, in main
[1766413715.665341] [ggpu126:3851517:0] cuda_copy_md.c:379 UCX ERROR attempt to allocate cuda memory without active context
[1766413715.665347] [ggpu126:3851517:0] uct_mem.c:158 UCX ERROR failed to allocate 536870912 bytes using md cuda_cpy for ucp_rndv_frags: No such device
[1766413715.665350] [ggpu126:3851517:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=ucp_rndv_frags) chunk: Out of memory
[1766413715.665530] [ggpu126:3851519:0] cuda_copy_md.c:687 UCX ERROR cuMemGetAddressRange(0x14e0bb9df600) error: invalid device context
[1766413715.665541] [ggpu126:3851519:0] ucp_mm.c:574 UCX ERROR uct_md_mem_query(dmabuf address 0x14e0bb9df600 length 121200) failed: Address not valid
[1766413715.665646] [ggpu120:3309122:0] cuda_copy_md.c:687 UCX ERROR cuMemGetAddressRange(0x1489f2962a00) error: invalid device context
[1766413715.665662] [ggpu120:3309122:0] ucp_mm.c:574 UCX ERROR uct_md_mem_query(dmabuf address 0x1489f2962a00 length 102960) failed: Address not valid
[1766413715.666129] [ggpu126:3851518:0] cuda_copy_md.c:379 UCX ERROR attempt to allocate cuda memory without active context
[1766413715.666138] [ggpu126:3851518:0] uct_mem.c:158 UCX ERROR failed to allocate 536870912 bytes using md cuda_cpy for ucp_rndv_frags: No such device
[1766413715.666141] [ggpu126:3851518:0] mpool.c:269 UCX ERROR Failed to allocate memory pool (name=ucp_rndv_frags) chunk: Out of memory
...
args.process(args)
args.process(args)
args.process(args)
args.process(args)
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
_process_common(args, None, Inifile.load(args.cfg))
_process_common(args, None, Inifile.load(args.cfg))
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
_process_common(args, None, Inifile.load(args.cfg))
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
_process_common(args, None, Inifile.load(args.cfg))
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
solver.run()
solver.run()
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
args.process(args)
solver.run()
solver.run()
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
args.process(args)
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
args.process(args)
_process_common(args, None, Inifile.load(args.cfg))
_process_common(args, None, Inifile.load(args.cfg))
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 548, in process_run
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
solver.run()
solver.run()
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
_process_common(args, None, Inifile.load(args.cfg))
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/__main__.py", line 544, in _process_common
solver.run()
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/base.py", line 170, in run
self.advance_to(t)
self.advance_to(t)
self.advance_to(t)
File "/mnt/ceph-hdd/cold/nii00228/PyFR/PyFR_2.1.0dev/pyfr/integrators/std/controllers.py", line 181, in advance_to
and creating crazy output file names like pyfr-646052d5-a0f0-4637-a96d-6b794b041f7a.pyfrs