Poor TGV performance with GPU and MPI

When I run the case tgv in TGV Performance Numbers - General - PyFR with config file as follow:

[backend]
precision =double
rank-allocator = linear

[backend-cuda]
device-id = local-rank

[constants]
gamma = 1.4
mu = 6.25e-4
Pr = 0.71
Ps = 111.607

[solver]
system = navier-stokes
order = 3
anti-alias = none
viscosity-correction = none
shock-capturing = none

[solver-time-integrator]
scheme = rk4
controller = none
tstart = 0
tend = 1.0
dt = 1.0e-4
# atol = 0.000001
# rtol = 0.000001
# safety-fact = 0.9
# min-fact = 0.3
# max-fact = 2.5


[solver-interfaces]
riemann-solver = rusanov
# These ldg settings will hit the interface communication a little harder
ldg-beta = 0.5
ldg-tau = 0.1

[solver-interfaces-quad]
flux-pts = gauss-legendre
quad-deg = 6
quad-pts = gauss-legendre

[solver-elements-hex]
soln-pts = gauss-legendre
quad-deg = 6
quad-pts = gauss-legendre

[soln-plugin-nancheck]
# nsteps = 10
nsteps = 1000
[soln-plugin-integrate]
nsteps = 2000
#nsteps = 50000
file = integral_fp64.csv
header = true
vor1 = (grad_w_y - grad_v_z)
vor2 = (grad_u_z - grad_w_x)
vor3 = (grad_v_x - grad_u_y)

int-E = rho*(u*u + v*v + w*w)
int-enst = rho*(%(vor1)s*%(vor1)s + %(vor2)s*%(vor2)s + %(vor3)s*%(vor3)s)

[soln-plugin-writer]
dt-out = 0.01
# depending on the environment you might want to change this
basedir = .
basename = nse_fp64_tgv_3d_p3-{t:.2f}

[soln-ics]
rho = 1
u = sin(x)*cos(y)*cos(z)
v = -cos(x)*sin(y)*cos(z)
w = 0
p = Ps + (1.0/16.0)*(cos(2*x) + cos(2*y))*(cos(2*z + 2))

I run pyfr 1.14.0 with slurm , with 2 GPU , the wall time is

[root@snode06 slurm-pyfr-bench-2c-long]# h5dump -d stats nse_fp64_tgv_3d_p3-1.00.pyfrs
HDF5 "nse_fp64_tgv_3d_p3-1.00.pyfrs" {
DATASET "stats" {
   DATATYPE  H5T_STRING {
      STRSIZE 218;
      STRPAD H5T_STR_NULLPAD;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SCALAR
   DATA {
   (0): "[data]
           fields = rho,rhou,rhov,rhow,E
           prefix = soln

           [backend]
           rank-allocation = 0,1

           [solver-time-integrator]
           tcurr = 1.0
           wall-time = 1309.563977241516
           nsteps = 10046
           nacptsteps = 10046
           nrjctsteps = 0
           nfevals = 40184

           "
   }
}
}


but with 1 GPU , wall time is

[root@snode06 lilu]# h5dump -d stats nse_fp64_tgv_3d_p3-1.00.pyfrs
HDF5 "nse_fp64_tgv_3d_p3-1.00.pyfrs" {
DATASET "stats" {
   DATATYPE  H5T_STRING {
      STRSIZE 216;
      STRPAD H5T_STR_NULLPAD;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SCALAR
   DATA {
   (0): "[data]
           fields = rho,rhou,rhov,rhow,E
           prefix = soln

           [backend]
           rank-allocation = 0

           [solver-time-integrator]
           tcurr = 1.0
           wall-time = 416.4806327819824
           nsteps = 10046
           nacptsteps = 10046
           nrjctsteps = 0
           nfevals = 40184

           "
   }
}
}
it seems that using more GPUs make it even slow , why ?   

Try completely removing the nancheck and integrate plugins.

What GPUs are you running on and what size mesh are you using?

I run on 2 A100s and the mesh is

[root@snode06 tgv_out]# h5ls nse_fp64_tgv_3d_p3-0.02.pyfrs
config                   Dataset {SCALAR}
config-0                 Dataset {SCALAR}
mesh_uuid                Dataset {SCALAR}
soln_hex_p0              Dataset {64, 5, 32000}
soln_hex_p1              Dataset {64, 5, 32000}
stats                    Dataset {SCALAR}

Can you try increasing the mesh size? You can try this tool if you aren’t familiar with gmsh: GitHub - WillTrojak/basic_gmsh: Python scripts to make basic gmsh mesh files

Please rerun with just a single file output as opposed to ~100.

Regards, Freddie.

Yep, I didn’t notice that. That will be ruining your performance