Example Datasets/configs for PyFR?

Hello everyone,

I am Pablo, and I will be participating in the Student Cluster Competition in the team EAFIT/PURDUE in ISC15.

One of the applications we have to run for the competition is Pyfr, so i was wondering if you maybe have some example datasets, beside the two that come with PyFr itself, so we can test performance in different hardware options.

In addition, I was wondering if you have any advice regarding to use CPUS, GPUS or (if possible) a combination of both for getting a better performance with pyFR.

Thanks in advance for your help.

One of the applications we have to run for the competition is Pyfr, so i
was wondering if you maybe have some example datasets, beside the two
that come with PyFr itself, so we can test performance in different
hardware options.

I would certainly recommend avoiding the two provided test cases for
benchmarking purposes. Both are two dimensional and contain relatively
few elements. As run-times are dominated by overhead from the Python
later, inefficiencies in BLAS, and memory bandwidth.

3D test cases are far more suitable and realistic. I am sure someone on
the mailing list will post some.

In addition, I was wondering if you have any advice regarding to use
CPUS, GPUS or (if possible) a combination of both for getting a better
performance with pyFR.

The CUDA backend should perform reasonably well out of the box. Very
little configuration should be required on the software side of things.

The OpenCL backend is a little bit more complicated. The underlying
clBLAS library supports auto-tuning and this can result in improved
performance on some devices. There is some documentation on the clBLAS
github page on how to go about tuning clBLAS.

Finally, there is the C/OpenMP backend. There are a variety of options
here. Firstly, you will want to run with one MPI rank per NUMA zone
(usually a socket). So use pyfr-mesh to partition the mesh into as many
pieces as there are sockets. Be sure that the MPI/OpenMP libraries are
getting along and correctly binding threads and processes to cores and
processors.

PyFR is capable of using either a serial or parallel cblas library. If
a serial BLAS library is specified then PyFR will perform its own
paralleisation using OpenMP. This almost always outperforms the
multi-threading done by the BLAS libraries themselves.

In my experience ATLAS (serial) tends to perform best, followed by MKL
(serial) and OpenBLAS (serial). The trick to using MKL is to specify it as:

[backend-openmp]
cblas-st = libmkl_rt.so ; if letting PyFR do the threading
cblas-mt = libmkl_rt.so ; if letting MKL do the threading

To stop MKL from performing its own threading (allowing PyFR to do the
threading) you can export 1MKL_NUM_THREADS=1`.

The choice of compiler for the OpenMP backend does not normally make a
huge amount of difference. Although it can be specified via

cc = my_compiler_command

PyFR is capable of running heterogeneously. However, the domain
decomposition must be done manually. The idea is to run one MPI rank
per GPU/OpenCL device/NUMA zone. Start by partitioning the mesh with a
suitable set of weighting factors (these need to be determined
empirically). So if we had a single CUDA GPU and two CPUs we might do:

pyfr-mesh partition 10:2:2 mymesh.pyfrm

with the first partition being given a weight of 10, and the remaining
two a weight of 2. To run the simulation do:

mpirun -np 3 ./launcher.sh mymesh.pyfrm mycfg.ini

where launcher.sh is something along the lines of:

#!/bin/bash

case ${MV2_COMM_WORLD_LOCAL_RANK} in
    "0" )
        BACKEND="cuda" ;;
    "1" )
        BACKEND="openmp" ;;
    "2" )
        BACKEND="openmp" ;;
esac

pyfr-sim -b${BACKEND} -p run $1 $2

The above uses the ${MV2_COMM_WORLD_LOCAL_RANK} environmental variable
to get our node local MPI rank before MPI_Init has been called. This is
MVAPICH2 specific although OpenMPI provides a similar variable. The
first rank then picks up the CUDA backend while the remaining two get
the OpenMP backend.

Hope this helps.

Regards, Freddie.

Hi Pablo,

Many thanks for your interest in PyFR.

Freddie - thanks for the details below.

We will have a think and post a sensible 3D test case for benchmarking in the next few days.

Cheers

Peter

Dr Peter Vincent MSci ARCS DIC PhD
Senior Lecturer and EPSRC Early Career Fellow
Department of Aeronautics
Imperial College London
South Kensington
London
SW7 2AZ
UK

web: www.imperial.ac.uk/aeronautics/research/vincentlab
twitter: @Vincent_Lab

Thank you very much.

This is the information I was looking for.

Hi Pablo,

I have talked with Peter about test case and I am attaching a simple 3D example.

Please find hex and tet GMSH meshes and attached ini file. Please note that the largest mesh requires 19GB memory.

Hex mesh: cube_hex.zip - Google Drive
Tet mesh: cube_tet.zip - Google Drive

Same as examples in PyFR, you need to convert and partition mesh, then run it.
Convert:
pyfr-mesh convert cube_XXX_YY.msh cube_XXX_YY.pyfrm

Partition (n pieces):
pyfr-mesh partition n cube_XXX_YY.pyfrm .

Run:
mpirun -np n pyfr-sim -p run cube_XXX.YY.pyfrm config.ini

You can increase or reduce simulation time by modifying [soln-output] section in config.ini file.

[soln-output]
times = range(0.0, end_time, dump)

In the given file, end_time and dump are 0.1 and 2, respectively, thus it will be running for t=0.1 and dump twice.

Regards,

Jin Seok

config.ini (1.6 KB)

Thank you very much.

It is really helpful for me

I was able to run the coarsest hex case (17280 hex elements) with a combination of OpenMP and CUDA backends (5 MPI ranks via Open MPI) on my MacBook Pro with single precision (~610 MB of memory required) in about 6 minutes, 50 seconds. Very cool guys! Now about those curvilinear elements… :wink:

Best Regards,

Zach Davis

Hi Zach,

I was able to run the coarsest hex case (17280 hex elements) with a
combination of OpenMP and CUDA backends (5 MPI ranks via Open MPI) on my
MacBook Pro with single precision (~610 MB of memory required) in about
6 minutes, 50 seconds. Very cool guys! Now about those curvilinear
elements… :wink:

It is quite neat. The only tricky bit is figuring out the partition
weighting factors. This is currently a royal pain. Hopefully future
versions of PyFR will (during the first couple of minutes) have a go at
automatically figuring out the optimal weights for the current set-up.
All of the building blocks are there; just the plumbing needed to make
it all work.

Regards, Freddie.

Hi Freddie,

I was trying to run the same case with the OpenCL backend–I know, I know it’s a bit more tenuous–and the simulation progress reached 98% before there was a pyopencl error which killed the MPI rank I had allocated for the GPU:

(venv) [zdavis@Rahvin cubes]$ mpirun -np 5 ./launcher.sh cube_hex24.pyfrm cube.ini

99.8% [==============================> ] 0.10/0.10 ela: 00:07:04 rem: 00:00:00Traceback (most recent call last):

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr", line 38, in <module>

main()

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr", line 32, in main

args.process(args)

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/sim.py", line 87, in process_run

args, read_pyfr_data(args.mesh), None, Inifile.load(args.cfg)

File "/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/mpmath/ctx_mp.py", line 1301, in g

return f(*args, **kwargs)

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/sim.py", line 59, in _process_common

solver.run()

File "/Users/zdavis/Applications/PyFR/pyfr/integrators/base.py", line 112, in run

solns = self.advance_to(t)

File "/Users/zdavis/Applications/PyFR/pyfr/integrators/controllers.py", line 79, in advance_to

idxcurr = self.step(self.tcurr, dt)

File "/Users/zdavis/Applications/PyFR/pyfr/integrators/steppers.py", line 154, in step

rhs(t + dt/2.0, r2, r2)

File "/Users/zdavis/Applications/PyFR/pyfr/solvers/baseadvecdiff/system.py", line 57, in rhs

runall([q1])

File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/backend.py", line 183, in runall

self.queue_cls.runall(sequence)

File "/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/types.py", line 114, in runall

q._exec_nowait()

File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py", line 303, in _exec_nowait

self._exec_item(*self._items.popleft())

File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py", line 288, in _exec_item

item.run(self, *args, **kwargs)

File "/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/provider.py", line 38, in run

fun(queue.cl_queue_comp, (dims[-1],), None, *narglst)

File "/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/pyopencl/__init__.py", line 509, in kernel_call

self.set_args(*args)

File "/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/pyopencl/__init__.py", line 549, in kernel_set_args

self.set_arg(i, pack(arg_type_char, arg))

struct.error: required argument is not a float

Hi Zach,

I was trying to run the same case with the OpenCL backend--I know,
I know it's a bit more tenuous--and the simulation progress reached
98% before there was a pyopencl error which killed the MPI rank I
had allocated for the GPU:

(venv) [zdavis@Rahvin cubes]$ mpirun -np 5 ./launcher.sh
cube_hex24.pyfrm cube.ini

99.8% [==============================> ] 0.10/0.10 ela: 00:07:04
rem: 00:00:00Traceback (most recent call last):

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr", line 38,
in <module>

main()

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr", line 32,
in main

args.process(args)

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/sim.py", line
87, in process_run

args, read_pyfr_data(args.mesh), None, Inifile.load(args.cfg)

File
"/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/mpmath/ctx_mp.py",

line 1301, in g

return f(*args, **kwargs)

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/sim.py", line
59, in _process_common

solver.run()

File "/Users/zdavis/Applications/PyFR/pyfr/integrators/base.py",
line 112, in run

solns = self.advance_to(t)

File
"/Users/zdavis/Applications/PyFR/pyfr/integrators/controllers.py",
line 79, in advance_to

idxcurr = self.step(self.tcurr, dt)

File
"/Users/zdavis/Applications/PyFR/pyfr/integrators/steppers.py",
line 154, in step

rhs(t + dt/2.0, r2, r2)

File
"/Users/zdavis/Applications/PyFR/pyfr/solvers/baseadvecdiff/system.py",

line 57, in rhs

runall([q1])

File
"/Users/zdavis/Applications/PyFR/pyfr/backends/base/backend.py",
line 183, in runall

self.queue_cls.runall(sequence)

File
"/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/types.py",
line 114, in runall

q._exec_nowait()

File
"/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py", line
303, in _exec_nowait

self._exec_item(*self._items.popleft())

File
"/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py", line
288, in _exec_item

item.run(self, *args, **kwargs)

File
"/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/provider.py",
line 38, in run

fun(queue.cl_queue_comp, (dims[-1],), None, *narglst)

File
"/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/pyopencl/__init__.py",

line 509, in kernel_call

self.set_args(*args)

File
"/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/pyopencl/__init__.py",

line 549, in kernel_set_args

self.set_arg(i, pack(arg_type_char, arg))

struct.error: required argument is not a float

--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI
processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--------------------------------------------------------------------------

/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/pytools/prefork.py:74:

UserWarning: Prefork server exiting upon apparent death of parent

warn("%s exiting upon apparent death of %s" % (who, partner))

Any ideas where this might be coming from? Is it within PyFR, or
rather something wrong with the pyopencl package?

I have my suspicions. Around here:

<https://github.com/vincentlab/PyFR/blob/develop/pyfr/backends/opencl/provider.py#L38&gt;

can you do:

try:
    fun(queue.cl_queue_comp, (dims[-1],), None, *narglst)
except:
    print([a.__class__ for a in narglst])
    raise

so the plan is to output the types of all of the variables passed.

  • From here we can figure out which float is not actually a float.

Regards, Freddie.

Hi Freddie,

Here’s what we have:

(venv) [zdavis@Rahvin cubes]$ mpirun -np 5 ./launcher.sh cube_hex24.pyfrm cube.ini

99.8% [==============================> ] 0.10/0.10 ela: 00:06:55 rem: 00:00:00[<class 'int'>, <class 'int'>, <class 'memoryview'>, <class 'pyopencl._cl.Buffer'>, <class 'int'>, <class 'pyopencl._cl.Buffer'>, <class 'int'>, <class 'pyopencl._cl.Buffer'>, <class 'int'>]

Traceback (most recent call last):

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr", line 38, in <module>

main()

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr", line 32, in main

args.process(args)

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/sim.py", line 87, in process_run

args, read_pyfr_data(args.mesh), None, Inifile.load(args.cfg)

File "/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/mpmath/ctx_mp.py", line 1301, in g

return f(*args, **kwargs)

File "/Users/zdavis/Applications/PyFR/pyfr/scripts/sim.py", line 59, in _process_common

solver.run()

File "/Users/zdavis/Applications/PyFR/pyfr/integrators/base.py", line 112, in run

solns = self.advance_to(t)

File "/Users/zdavis/Applications/PyFR/pyfr/integrators/controllers.py", line 79, in advance_to

idxcurr = self.step(self.tcurr, dt)

File "/Users/zdavis/Applications/PyFR/pyfr/integrators/steppers.py", line 154, in step

rhs(t + dt/2.0, r2, r2)

File "/Users/zdavis/Applications/PyFR/pyfr/solvers/baseadvecdiff/system.py", line 57, in rhs

runall([q1])

File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/backend.py", line 183, in runall

self.queue_cls.runall(sequence)

File "/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/types.py", line 114, in runall

q._exec_nowait()

File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py", line 303, in _exec_nowait

self._exec_item(*self._items.popleft())

File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py", line 288, in _exec_item

item.run(self, *args, **kwargs)

File "/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/provider.py", line 39, in run

fun(queue.cl_queue_comp, (dims[-1],), None, *narglst)

File "/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/pyopencl/__init__.py", line 509, in kernel_call

self.set_args(*args)

File "/Users/zdavis/Applications/PyFR/venv/lib/python3.4/site-packages/pyopencl/__init__.py", line 549, in kernel_set_args

self.set_arg(i, pack(arg_type_char, arg))

struct.error: required argument is not a float

Just as I suspected. For the very last time step the type of the
scalar variable t becomes a NumPy float64 as opposed to a Python
float. When this happens we enter a world of pain as NumPy data types
do not get along with our argument passing regime.

I'll push a patch fixing this later.

Regards, Freddie.

Hi Zach,

As a follow-up I’ve submitted a pull request which resolves this issue:

https://github.com/vincentlab/PyFR/pull/40

Let me know if it works.

Regards, Freddie.

Wednesday, 11 March 2015

Hi Freddie,

I can verify the changes that you made in your latest commit for PR-40 resolve the issue I was running into. I can also build the documentation without having to apply the 2to3 patch. Thanks for addressing these so quickly. I have noticed when doing these mixed backend simulations that it can be a bit of a challenge to determine the number of parts to distribute across the GPU and CPUs in order to ensure that the GPU is fully utilized. Have you guys discovered any best-practices or guidelines? I imagine it changes with the underlying hardware, backends being used, and elements in the model, so it isn’t too hard to imagine that it’s likely a bit difficult to nail down. Otherwise looks awesome!

Best Regards,

Zach

Hi Zach,

It depends on almost everything! The plan is to have PyFR
periodically look at which MPI ranks are holding up the simulation and
to remove elements from these ranks (by repartitioning with different
waits). After a couple of iterations we should be close to the optimum.

A post was split to a new topic: Error: Authorization required