Questions about openmp backend performance

Hi everyone

I am using ubuntu12.04 with 64 AMD cores and Nvidia GT620 (96 cores). I am a little confused about the performance of OpenMP backend
Here is some test result on my server: the test case is cube_tet24 from Jin Seok Park’s post [Google Groups]

1. OPENCL, 64*CPU, ela:00:38:52
2. OPENCL, 96*GPU, mem object allocation failure
3. CUDA, 96*GPU, out of memory
4. OPENMP, OMP_NUM_THREADS=32, set cblas-type=serial, rem: 02:30:00
5. OPENMP, OMP_NUM_THREADS=32, set cblas-type=parallel, rem: 14:31:07
6. MPI+OPENMP, OMP_NUM_THREADS=1, serial, 32 partitions, rem: 00:50:00  

  1. CUDA is not applicable because of memory limit, is it possible to circumvent this problem? I have 256 GB ram for cpu.
  2. How to interpret the OPENMP results? what is the difference between parallel and serial.
  3. I thought MPI is favorable on cluster rather than on a single server. Why MPI+OPENMP seems faster than using OPENMP solely?
  4. Why OPENCL seems faster than other available configuration?

Hi,

1. CUDA is not applicable because of memory limit, is it possible to
circumvent this problem? I have 256 GB ram for cpu.

No. Generally this is not a problem in the sense that for real world
simulations you'll almost always be compute -- as opposed to memory --
bound. As a point of reference if you fully load up an NVIDIA K40c (12
GiB of memory) with a simulation to get any reasonable statistics out of
it you will probably need to run the simulation for three weeks or more.

2. How to interpret the OPENMP results? what is the difference
between parallel and serial.

The OpenMP results depend heavily on the configuration of your system
and what BLAS library you're using. A key point is that OpenMP only
performs well inside of a single NUMA zone.

For instance, if you have 64 AMD cores in a single system then you
probably have four sockets each with a 16 core CPU. Each of these CPUs
will have two NUMA zones for a total of eight NUMA zones. Therefore,
the optimal configuration is to partition the mesh into eight pieces and
run each piece with four threads. Care is necessary to ensure that
these threads are 'pinned' to the correct cores. Getting this right
when using a combination of MPI + OpenMP on a single system can
sometimes be painful.

The parallel vs serial distinction depends on if the BLAS library you
are using is multi-threaded or not. If it is multi-threaded then you'll
want to set this to be parallel, otherwise serial. The recommendation
is to use a single threaded BLAS library (ATLAS works best, followed by
MKL, and then OpenBLAS) and let PyFR do the parallelism as opposed to
the BLAS library itself.

3. I thought MPI is favorable on cluster rather than on a single
server. Why MPI+OPENMP seems faster than using OPENMP solely?

Practically a system with eight NUMA zones is basically eight separate
systems with cache coherency.

4. Why OPENCL seems faster than other available configuration?

It is problem and system specific. In my experience when tuned
correctly the OpenMP backend should be able to outperform the OpenCL
backend at higher polynomial orders. However, it does require more work
to configure.

Regards, Freddie.

Hi, Freddie

I still have some problem on the backends.

  1. According to the output of clinfo (in the attachment), I have two OpenCL platform, in the NV GT 620 GPU platform. clinfo.out indicates I have 96 cores and ~256MB RAM. But in the output of nvidia-smi (in the following lines), I have 1024MB RAM. Is it a feature or a bug? The same phenomena occurs in CPU platform, clinfo.out indicates I have 64 cores and ~63GB RAM, but actually I have 256GB RAM.

—output of nvidia-smi—

`

Mon Jun 29 12:45:56 2015

clinfo.out (6.21 KB)

Hi,

1. According to the output of clinfo (in the attachment), I have
two OpenCL platform, in the NV GT 620 GPU platform. clinfo.out
indicates I have 96 cores and ~256MB RAM. But in the output of
nvidia-smi (in the following lines), I have 1024MB RAM. Is it a
feature or a bug? The same phenomena occurs in CPU platform,
clinfo.out indicates I have 64 cores and ~63GB RAM, but actually I
have 256GB RAM.

OpenCL limits the size of any one single allocation to 1/4 of the
total memory of the device. Some OpenCL implementations have
environmental variables which allow a single allocation to go above
this. This should not be an issue for PyFR which performs multiple
allocations none of which should be greater than 1/4 of the device memor
y.

The result indicates the program is not so clever to use OpenMP on
each NUMA. right? How to improve this? The opencl related env.
vars. related to this problem is GOMP_CPU_AFFINITY. Is it possible
to use it to bind openmp on one NUMA?

As I indicated in my previous e-mail this is known to be a pain. With
Intel MPI and ICC everything should 'just work'. More recent versions
of OpenMPI (1.8 and later) also make this relatively painless. In
these scenarios it is important to use a single threaded BLAS library
and to let PyFR handle the parallelism with OpenMP. Otherwise
everything becomes twice as complicated.

Regards, Freddie.