Scaling studies for isentropic vortex

Hi,

I posted my first go at benchmarking PyFR against CUDA backend.

I'd like to work more on it and extend it for the remaining cases in your JCP
paper.

The actual study is in isentropic_vortex.sh and I apologise for a lot of
cluster specific stuff. There's some LSF data manager stuff there which may be
confusing if you haven't seen it before.

I am testing the code on Power8+4GPU arranged 2GPU per socket. At the moment
the results are pretty random and I don't see weak or strong scaling at all.
This may be to do with our cluster or with my installation, but I thought I'll
share the files and ask if you have any feedback.

The only thing I was able to verify was that the installation spawns processes
across GPUs across different nodes.

Let me know if you have any ideas. Also, please let me know if this is the
right way of having this discussion.

Best wishes,
Robert

Hi Robert,

The vortex test case is 2D and on a grid with relatively few elements. As such just running the case as-is you are close to the strong scaling limit. The working set here is almost small enough to fit into cache!

As such the case should not be used for benchmarking. For this you will want to consider a 3D Navier-Stokes case. (Which is also somewhat more realistic when compared with the 2D Euler equations.)

Regards, Freddie.

Great... Yes, I didn't estimate the degrees of freedom for this. Trying to be
too quick. I've uploaded the sd7003 case. I added the residual printing and I
can already see it produced 2x speedup going from 1 to 2 nodes. I am doing a
full test now.

I have several related questions and comments:
1) what is [backend] rank-allocator = linear? Does this not conflict with MPI
    options e.g. -rank-by from OMPI or binding policy of MVAPICH. This is
    significant for me as I have two GPUs per socket and 64 hardware threads
    per socket. I don't want 4 process to run on the first socket alone.
    
    I print my bindings in MVAPICH and it looks ok, but I want to double check
    that python is not doing something else under the hood.

2) What is the rough DoF estimate for the strong scaling limit you observed
    with PyFR?

3) At the moment I am setting 4 MPI proc per node as I've got 4 GPUs, but I
    assume there's nothing to stop me from using more. Has anyone looked at
    optimal ratio of MPI processes to GPUs?

Thanks,
Robert

Hi Robert,

Great... Yes, I didn't estimate the degrees of freedom for this. Trying to be
too quick. I've uploaded the sd7003 case. I added the residual printing and I
can already see it produced 2x speedup going from 1 to 2 nodes. I am doing a
full test now.

I have several related questions and comments:
1) what is [backend] rank-allocator = linear? Does this not conflict with MPI
    options e.g. -rank-by from OMPI or binding policy of MVAPICH. This is
    significant for me as I have two GPUs per socket and 64 hardware threads
    per socket. I don't want 4 process to run on the first socket alone.

So the rank allocator decides how partitions in the mesh should be mapped onto MPI ranks. The linear allocator is exactly what you would expect: the first MPI rank gets the first partition, and so on and so forth. There is also a random allocator. Having four processes on one socket is probably okay; I doubt you would notice much of a difference compared to an even split. When running with the CUDA backend PyFR is entirely single threaded and offloads all relevant computation to the GPU. We also work hard to mask latencies for host-to-device transfers so even sub-optimal assignments usually work out.

    I print my bindings in MVAPICH and it looks ok, but I want to double check
    that python is not doing something else under the hood.

2) What is the rough DoF estimate for the strong scaling limit you observed
    with PyFR?

That is a good question and one which I do not have much of a feel for. I would say that you want on the order of a thousand *elements* per GPU and below that you may being to experience a flattening out of your strong scaling curve.

3) At the moment I am setting 4 MPI proc per node as I've got 4 GPUs, but I
    assume there's nothing to stop me from using more. Has anyone looked at
    optimal ratio of MPI processes to GPUs?

One MPI rank per GPU will be optimal. Anything else will just introduce extra overheads.

Regards, Freddie.

Hi Robert

2) What is the rough DoF estimate for the strong scaling limit you observed
   with PyFR?

That is a good question and one which I do not have much of a feel for. I would say that you want on the order of a thousand *elements* per GPU and below that you may being to experience a flattening out of your strong scaling curve.

Based on our Gordon Bell results on K20X, things start to tail off for 3D compressible Navier Stokes when you get down to ~ 1 element per CUDA core, where here an element is a P4 hexahedra with 5 x 5 x 5 x 5 DOFs = 625 DOFs.

Cheers

Peter

Dr Peter Vincent MSci ARCS DIC PhD
Reader in Aeronautics and EPSRC Fellow
Department of Aeronautics
Imperial College London
South Kensington
London
SW7 2AZ
UK

web: www.imperial.ac.uk/aeronautics/research/vincentlab
twitter: @Vincent_Lab