When I run the case tgv in TGV Performance Numbers - General - PyFR with config file as follow:

```
[backend]
precision =double
rank-allocator = linear
[backend-cuda]
device-id = local-rank
[constants]
gamma = 1.4
mu = 6.25e-4
Pr = 0.71
Ps = 111.607
[solver]
system = navier-stokes
order = 3
anti-alias = none
viscosity-correction = none
shock-capturing = none
[solver-time-integrator]
scheme = rk4
controller = none
tstart = 0
tend = 1.0
dt = 1.0e-4
# atol = 0.000001
# rtol = 0.000001
# safety-fact = 0.9
# min-fact = 0.3
# max-fact = 2.5
[solver-interfaces]
riemann-solver = rusanov
# These ldg settings will hit the interface communication a little harder
ldg-beta = 0.5
ldg-tau = 0.1
[solver-interfaces-quad]
flux-pts = gauss-legendre
quad-deg = 6
quad-pts = gauss-legendre
[solver-elements-hex]
soln-pts = gauss-legendre
quad-deg = 6
quad-pts = gauss-legendre
[soln-plugin-nancheck]
# nsteps = 10
nsteps = 1000
[soln-plugin-integrate]
nsteps = 2000
#nsteps = 50000
file = integral_fp64.csv
header = true
vor1 = (grad_w_y - grad_v_z)
vor2 = (grad_u_z - grad_w_x)
vor3 = (grad_v_x - grad_u_y)
int-E = rho*(u*u + v*v + w*w)
int-enst = rho*(%(vor1)s*%(vor1)s + %(vor2)s*%(vor2)s + %(vor3)s*%(vor3)s)
[soln-plugin-writer]
dt-out = 0.01
# depending on the environment you might want to change this
basedir = .
basename = nse_fp64_tgv_3d_p3-{t:.2f}
[soln-ics]
rho = 1
u = sin(x)*cos(y)*cos(z)
v = -cos(x)*sin(y)*cos(z)
w = 0
p = Ps + (1.0/16.0)*(cos(2*x) + cos(2*y))*(cos(2*z + 2))
```

I run pyfr 1.14.0 with slurm , with 2 GPU , the wall time is

```
[root@snode06 slurm-pyfr-bench-2c-long]# h5dump -d stats nse_fp64_tgv_3d_p3-1.00.pyfrs
HDF5 "nse_fp64_tgv_3d_p3-1.00.pyfrs" {
DATASET "stats" {
DATATYPE H5T_STRING {
STRSIZE 218;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "[data]
fields = rho,rhou,rhov,rhow,E
prefix = soln
[backend]
rank-allocation = 0,1
[solver-time-integrator]
tcurr = 1.0
wall-time = 1309.563977241516
nsteps = 10046
nacptsteps = 10046
nrjctsteps = 0
nfevals = 40184
"
}
}
}
```

but with 1 GPU , wall time is

```
[root@snode06 lilu]# h5dump -d stats nse_fp64_tgv_3d_p3-1.00.pyfrs
HDF5 "nse_fp64_tgv_3d_p3-1.00.pyfrs" {
DATASET "stats" {
DATATYPE H5T_STRING {
STRSIZE 216;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "[data]
fields = rho,rhou,rhov,rhow,E
prefix = soln
[backend]
rank-allocation = 0
[solver-time-integrator]
tcurr = 1.0
wall-time = 416.4806327819824
nsteps = 10046
nacptsteps = 10046
nrjctsteps = 0
nfevals = 40184
"
}
}
}
it seems that using more GPUs make it even slow , why ?
```