Error CUDAInvalidDevice with Slurn srun

Hi I’m having the following error when running simulations via slurm but I cannot troubleshoot it, any idea?

srun: error: largemem-4-1: task 2: Exited with exit code 1
Traceback (most recent call last):
  File "/miniconda3/envs/pyFR/bin/pyfr", line 8, in <module>
    sys.exit(main())
  File "/miniconda3/envs/pyFR/lib/python3.10/site-packages/pyfr/__main__.py", line 118, in main
    args.process(args)
  File "/miniconda3/envs/pyFR/lib/python3.10/site-packages/pyfr/__main__.py", line 251, in process_run
    _process_common(
  File "/miniconda3/envs/pyFR/lib/python3.10/site-packages/pyfr/__main__.py", line 230, in _process_common
    backend = get_backend(args.backend, cfg)
  File "/miniconda3/envs/pyFR/lib/python3.10/site-packages/pyfr/backends/__init__.py", line 12, in get_backend
    return subclass_where(BaseBackend, name=name.lower())(cfg)
  File "/miniconda3/envs/pyFR/lib/python3.10/site-packages/pyfr/backends/cuda/base.py", line 41, in __init__
    self.cuda.set_device(get_local_rank())
  File "/miniconda3/envs/pyFR/lib/python3.10/site-packages/pyfr/backends/cuda/driver.py", line 481, in set_device
    self.lib.cuDeviceGet(dev, devid)
  File "/miniconda3/envs/pyFR/lib/python3.10/site-packages/pyfr/ctypesutil.py", line 33, in _errcheck
    raise self._statuses[status]
pyfr.backends.cuda.driver.CUDAInvalidDevice
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 14

Moreover when I use the plugin to time average quantities, simulations gets very very very slow, is this normal?

The setup for time averaging which I’m using is:

;TIME AVERAGE QUANTITIES
[soln-plugin-tavg]
 nsteps = 10 			                                ;accumulate averaged quantities every nsteps
 mode = continuous			                            ;time average over all previous time written every dt-out
 dt-out = 1.66e-3			                                ;write time averaged file every dt-out chosen
 basedir = time_avg_quantities
 basename = All-tavg-{t:1f}

 avg-p = p 			                                    ;pressure average
 avg-rho = rho                                           ;density average
 avg-u = u 			                                    ;u-component velocity average
 avg-v = v 			                                    ;v-component velocity average
 avg-w = w   			                                ;w-component velocity average

 avg-vel = sqrt(u*u + v*v + w*w)                         

 avg-uu = u*u
 avg-vv = v*v
 avg-ww = w*w

 avg-upup = uu - u*u
 avg-vpvp = vv - v*v
 avg-wpwp = ww - w*w 

 avg-upvp = uv - u*v
 avg-upwp = uw - u*w
 avg-vpwp = vw - v*w

 avg-urms = sqrt(uu - u*u + vv - v*v + ww -w*w)

Moreover I’m only interested in a average which refers to all the timestep simulated, is there a possibility to achieve such a thing easily in pyFR? I do not care about time averaging within intermediate windows

@fdw might have more insight on the CUDA related error.

The issue with the time averager is that you are outputting a file every 1.66e-3, which depending on your time step may be very frequent. Just to reiterate the docs, nsteps is the frequency that the accumulators for the average are updated, and dt-out is the frequency that average files are written. In most cases you don’t want to write out the average data very frequently.

It is highly likely you are launching more ranks per node than there are GPUs.

Regards, Freddie.

1 Like

Regarding time averaging, a better setup considering a total simulation time of t=10s would be something like:

[soln-plugin-tavg]
 nsteps = 100 			                                ;accumulate averaged quantities every nsteps
 mode = continuous			                            ;time average over all previous time written every dt-out
 dt-out = 1			                                ;write time averaged file every dt-out chosen
 basedir = time_avg_quantities
 basename = All-tavg-{t:1f}

 avg-p = p 			                                    ;pressure average
 avg-rho = rho                                           ;density average
 avg-u = u 			                                    ;u-component velocity average
 avg-v = v 			                                    ;v-component velocity average
 avg-w = w   			                                ;w-component velocity average

 avg-vel = sqrt(u*u + v*v + w*w)                         

 avg-uu = u*u
 avg-vv = v*v
 avg-ww = w*w

 avg-upup = uu - u*u
 avg-vpvp = vv - v*v
 avg-wpwp = ww - w*w 

 avg-upvp = uv - u*v
 avg-upwp = uw - u*w
 avg-vpwp = vw - v*w

 avg-urms = sqrt(uu - u*u + vv - v*v + ww -w*w)

By the way, just as a general advice: do you suggest doing averaging in pyFR directly or doing it in a post processing phase (e.g. in paraview for example). Is there any pros and cons that one must account for?

It depends what you mean by averaging. Obtaining average quantities in ParaView will be expensive as you’ll need to output the entire flowfield with some frequency, convert all of these flowfields to .vtu files, and then open them all up in ParaView to obtain the average. In this regard the plugin built into PyFR is much more efficient.

Regards, Freddie.

1 Like