OpenCL Runtime Error

Sunday, 2 November 2014

Hi All,

I’m back with another question. I was attempting to compare and contrast the advantages and/or disadvantages with the various backends available. However, I seem to be encountering a runtime error with clBLAS which I was hoping someone might be able to interpret and assist with. I’m running PyFR v0.2.3. I’ve included the invocation and corresponding stack trace for your review. Any input you might be able to provide as to the root cause would be greatly appreciated.

pyfr-sim -n 100 -p -b opencl run couette_flow_2d.pyfrm couette_flow_2d.ini

Traceback (most recent call last):
File "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim", line 112, in <module>
main()
File "/usr/local/lib/python2.7/site-packages/mpmath/ctx_mp.py", line 1301, in g
return f(*args, **kwargs)
File "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim", line 108, in main
solver.run()
File "/Users/zdavis/Applications/PyFR/pyfr/integrators/base.py", line 114, in run
solns = self.advance_to(t)
File "/Users/zdavis/Applications/PyFR/pyfr/integrators/controllers.py", line 79, in advance_to
idxcurr = self.step(self.tcurr, dt)
File "/Users/zdavis/Applications/PyFR/pyfr/integrators/steppers.py", line 107, in step
rhs(t, r0, r1)
File "/Users/zdavis/Applications/PyFR/pyfr/solvers/baseadvecdiff/system.py", line 17, in rhs
runall([q1])
File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/backend.py", line 187, in runall
self.queue_cls.runall(sequence)
File "/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/types.py", line 118, in runall
q._exec_nowait()
File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py", line 308, in _exec_nowait
self._exec_item(*self._items.popleft())
File "/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py", line 293, in _exec_item
item.run(self, *args, **kwargs)
File "/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/clblas.py", line 96, in run
1, qptr, 0, None, None)
File "/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/clblas.py", line 59, in _errcheck
raise RuntimeError('clBLAS')
RuntimeError: clBLAS

Best Regards,

Zach Davis

Hi Zach,

I'm back with another question. I was attempting to compare and
contrast the advantages and/or disadvantages with the various backends
available. However, I seem to be encountering a runtime error with
clBLAS which I was hoping someone might be able to interpret and assist
with. I'm running PyFR v0.2.3. I've included the invocation and
corresponding stack trace for your review. Any input you might be able
to provide as to the root cause would be greatly appreciated.

I've not yet got around to enumerating the various clBLAS status codes.
For the moment can you add "print status" before L59 of
pyfr/backends/opencl/clblas.py:

<https://github.com/vincentlab/PyFR/blob/develop/pyfr/backends/opencl/clblas.py#L59&gt;

which will output the (non-zero) return status. Then compare this value
to those in the clblasStatus enumeration in clBLAS.h:

  <https://github.com/clMathLibraries/clBLAS/blob/master/src/clBLAS.h#L123&gt;

and we'll know what's wrong.

Regards, Freddie.

Sunday, 2 November 2014

Hi Freddie,

I added a print status statement just above the conditional statement within the _errcheck function of the clblas.py module. I invoked PyFR as before; however, I’m not seeing that a status is being printed out that would correspond to anything within the clBLAS header file as you describe. It appears that PyFR is having an issue loading clBLAS altogether. The latest stack trace follows below:

pyfr-sim -n 100 -b opencl -p run couette_flow_2d.pyfrm couette_flow_2d.ini

Traceback (most recent call last):

File “/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim”, line 112, in

main()

File “/usr/local/lib/python2.7/site-packages/mpmath/ctx_mp.py”, line 1301, in g

return f(*args, **kwargs)

File “/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim”, line 82, in main

backend = get_backend(args.backend, cfg)

File “/Users/zdavis/Applications/PyFR/pyfr/backends/init.py”, line 11, in get_backend

return subclass_where(BaseBackend, name=name.lower())(cfg)

File “/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/base.py”, line 75, in init

self._providers = [k(self) for k in kprovs]

File “/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/clblas.py”, line 66, in init

self._wrappers = ClBLASWrappers()

File “/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/clblas.py”, line 14, in init

lib = load_library(‘clBLAS’)

File “/Users/zdavis/Applications/PyFR/pyfr/ctypesutil.py”, line 23, in load_library

raise OSError(‘Unable to load {0}’.format(name))

OSError: Unable to load clBLAS

Exception AttributeError: “‘OpenCLClBLASKernels’ object has no attribute ‘_wrappers’” in <bound method OpenCLClBLASKernels.del of <pyfr.backends.opencl.clblas.OpenCLClBLASKernels object at 0x10721d590>> ignored

Hi Zach,

I added a print status statement just above the conditional
statement within the _errcheck function of the clblas.py module. I
invoked PyFR as before; however, I'm not seeing that a status is
being printed out that would correspond to anything within the clBLAS
header file as you describe. It appears that PyFR is having an issue
loading clBLAS altogether. The latest stack trace follows below:

[snip]

I have the clBLAS library and header files installed at
/usr/local/lib64 and /usr/local/include, respectively. I have this
library path added to my DYLD_LIBRARY_PATH. It appears that pyopencl
v2014.1 installed via pip correctly. Is there perhaps another
environment variable that I need to set in order for PyFR to
recognize the OpenCL and clBLAS library installations?

It is likely that if you're getting this far pyopencl is being imported
without issue. However, interestingly, the previous stack trace implied
that clBLAS was being loaded successfully. But, for whatever reason, it
is now no longer being found.

Tracking down libraries is a right pain. In PyFR this comes down to a
call of the load_library function in pyfr/ctypesutil.py with the
argument "clBLAS".

On a Mac load_library will translate this to "libclBLAS.dylib". Next we
pass this to CDLL. Hence, you can reproduce the issue at the Python
prompt by doing:

  from ctypes import CDLL
  CDLL('libclBLAS.dylib')

if this works then all is well -- the linker/ctypes can find and load
the library without issue. If this fails PyFR will try some absolute
library paths. On a Mac, by default, we will look in /opt/local/lib
(this covers the common case of macports) and so try:

  CDLL('/opt/local/lib/libclBLAS.dylib')

and see if that gets us anywhere. You can add search directories for
PyFR to try via the PYFR_LIBRARY_PATH environmental variable. So, if
for whatever reason, ctypes is not looking in /usr/local/lib64 you can do:

  export PYFR_LIBRARY_PATH="/usr/local/lib64"

and we will try:

  CDLL('/usr/local/lib64/libclBLAS.dylib')

which should work.

Regards, Freddie.

Sunday, 2 November 2014

Hi Freddie,

Setting the PYFR_LIBRARY_PATH environment variable now gets me to the point where clBLAS is found. I now get a status code of -54. Stack trace follows:

pyfr-sim -n 100 -b opencl -p run couette_flow_2d.pyfrm couette_flow_2d.ini

0

-54

Traceback (most recent call last):

File “/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim”, line 112, in

main()

File “/usr/local/lib/python2.7/site-packages/mpmath/ctx_mp.py”, line 1301, in g

return f(*args, **kwargs)

File “/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim”, line 108, in main

solver.run()

File “/Users/zdavis/Applications/PyFR/pyfr/integrators/base.py”, line 114, in run

solns = self.advance_to(t)

File “/Users/zdavis/Applications/PyFR/pyfr/integrators/controllers.py”, line 79, in advance_to

idxcurr = self.step(self.tcurr, dt)

File “/Users/zdavis/Applications/PyFR/pyfr/integrators/steppers.py”, line 107, in step

rhs(t, r0, r1)

File “/Users/zdavis/Applications/PyFR/pyfr/solvers/baseadvecdiff/system.py”, line 17, in rhs

runall([q1])

File “/Users/zdavis/Applications/PyFR/pyfr/backends/base/backend.py”, line 187, in runall

self.queue_cls.runall(sequence)

File “/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/types.py”, line 118, in runall

q._exec_nowait()

File “/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py”, line 308, in _exec_nowait

self._exec_item(*self._items.popleft())

File “/Users/zdavis/Applications/PyFR/pyfr/backends/base/types.py”, line 293, in _exec_item

item.run(self, *args, **kwargs)

File “/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/clblas.py”, line 97, in run

1, qptr, 0, None, None)

File “/Users/zdavis/Applications/PyFR/pyfr/backends/opencl/clblas.py”, line 60, in _errcheck

raise RuntimeError(‘clBLAS’)

RuntimeError: clBLAS

Best Regards,

Zach

Hi Zach,

Hi Freddie,

Setting the PYFR_LIBRARY_PATH environment variable now gets me to the
point where clBLAS is found. I now get a status code of -54. Stack
trace follows:

Now we're getting somewhere! That error corresponds to -54 =
CL_INVALID_WORK_GROUP_SIZE. It is almost certainly an issue with clBLAS
rather than PyFR (we do not control the workgroup size in clBLAS functions).

However, there is something you can try. In the "staging" directory
where clBLAS was built there is a flaky utility called "tune". Running:

  export CLBLAS_STORAGE_PATH=`pwd`
  ./tune --gemm --float --double

should try and auto-tune clBLAS on your platform. This should include
finding kernels and work group sizes that work on your platform. The
results from auto-tuning vary considerably across hardware platforms: so
benefit very little while others show substantial improvements. If this
does not help then you'll probably want to file a bug report; although
the clBLAS project is not as active as one would hope.

Longer time I am looking to support ViennaCL as an alternative
matrix-multiplication provider for the PyFR OpenCL backend.
Unfortunately, the current API exposed by ViennaCL is not quite flexible
enough to make this possible.

It is highly disappointing that the wider OpenCL community has not
gotten behind clBLAS. As a consequence it only really works well on AMD
GPU -- and even then it isn't great. Indeed, as a broader point, I find
it unlikely that OpenCL will achieve acceptance within the scientific
community until BLAS-like functionality is integrated as an optional
part of the standard. Writing decent and portable level 3 BLAS kernels
is just too difficult for one project.

Regards, Freddie.

Sunday, 2 November 2014

Hi Freddie,

I’m giving your tuning idea a go. It appears to be a somewhat slow and compute intensive process. One question I have related to this tune executable is whether this process works on the installation, or those files compiled in the build directory (i.e. do I need to run make install again after this is has completed)?

Another question I have is what to provide for the openmp backend on systems running OS X. I set the cblas-mt parameter to /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework in one of the example case input files; however, PyFR gives a RuntimeError: Unable to load cblas. It was so much easier with CUDA! Thanks for your time today.

Best Regards,

Zach

Hi Zach,

I'm giving your tuning idea a go. It appears to be a somewhat slow and
compute intensive process. One question I have related to this tune
executable is whether this process works on the installation, or those
files compiled in the build directory (i.e. do I need to run make
install again after this is has completed)?

You should not need to run make install again. The tune utility will
create a .kdb file for each GPU on the system in the directory specified
by CLBLAS_STORAGE_PATH. This variable must also be set in the shell
where you run PyFR. Currently, as far as I can tell, it only tunes GPUs
and not CPUs.

Another question I have is what to provide for the openmp backend on
systems running OS X. I set the cblas-mt parameter to
/System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework
in one of the example case input files; however, PyFR gives a
RuntimeError: Unable to load cblas. It was so much easier with CUDA!
Thanks for your time today.

There are a couple of gotchas when it comes to running PyFR on OS X.
The first is that frameworks are actually directories; not shared
libraries. So what you want is actually:

/System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib

In terms of BLAS libraries I would recommend using a single-threaded
BLAS library where possible and let PyFR handle the multi-threading;
such libraries are specified as cblas-st. Both ATLAS and OpenBLAS can
be built as shared libraries without thread support. Intel's MKL
library can be told to only use a single thread by setting an
environmental variable. I am unsure if something similar exists for
accelerate.

The next problem is that clang on the Mac does not support OpenMP. The
kernels generated by PyFR will therefore fail to compile. Furthermore,
clang currently does not do a particularly good job at optimising
floating point code when compared with GCC/ICC. It should therefore be
avoided.

A simple solution to this is to install a copy of GCC on your Mac.
Unfortunately, many of the builds of GCC for OS X are `buggy':

calcium:Programming freddie$ cat test.c
#include <stdio.h>

int main()
{
    printf("Hello %f\n", 3.14);
    return 0;
}
calcium:Programming freddie$ gcc-mp-4.8 -Ofast -march=native test.c
/var/folders/rs/zwdffscn1qlgntxyby6vct800000gn/T//ccyZ8fE9.s:13:no such
instruction: `vmovsd LC0(%rip), %xmm0'

where we can see that GCC 4.8 on my Mac can not compile a simple Hello
World type application successfully. The underlying reason for this is
the fact that the assembler GCC uses does not understand AVX
instructions (like 'vmovsd') but for whatever reason GCC tries to emit
them anyway.

It is therefore necessary to first get a working build of GCC (or hack
it to disable the emission of AVX instructions). Alternatively, if you
have a license for ICC this should work out of the box without issue.

Finally, when running PyFR using the OpenMP backend the recommended
environmental variables (at least when not using MPI) are:

  export OMP_PROC_BIND=true
  export OMP_NUM_THREADS=n

where n is the number of real cores on the system. If you do all of
these things performance in excess of 50% of peak FLOP/S are possible;
the backend really does perform well!

Regards, Freddie.

Monday, 3 November 2014

Hi Freddie,

I have another follow-up question for you. I was able to confirm that using the openmp backend, OS X was unable to compile the requisite kernel due to lack of clang OpenMP support as you mentioned. I found a post on stackoverflow referencing partial OpenMP support in XCode 6 (clang 3.5) on OS X which can be activated using the -Xclang -fopenmp=libiomp5 options (http://stackoverflow.com/questions/26159225/openmp-support-in-xcode-6-clang-3-5). The responder, Alexey Bataev, is doing this development work at Intel I believe. Would passing these options to clang++ enable the kernel to compile on OS X, or are we still left waiting for the OpenMP implementation to be fully supported?

Best Regards

Zach

Hi Zach,

I have another follow-up question for you. I was able to confirm
that using the openmp backend, OS X was unable to compile the
requisite kernel due to lack of clang OpenMP support as you
mentioned. I found a post on stackoverflow referencing partial
OpenMP support in XCode 6 (clang 3.5) on OS X which can be activated
using the -Xclang -fopenmp=libiomp5 options
(macos - OpenMP support in Xcode 6 (clang 3.5) - Stack Overflow).
The responder, Alexey Bataev, is doing this development work at Intel
I believe. Would passing these options to clang++ enable the kernel
to compile on OS X, or are we still left waiting for the OpenMP
implementation to be fully supported?

I am unsure, but would be very interested to know if it fixes things or
not. The command line arguments used by PyFR can be tweaked by
modifying the list on:

<https://github.com/VincentLab/PyFR/blob/develop/pyfr/backends/openmp/compiler.py#L69&gt;

and possibly also the command used to link the resulting object code
together on line 79.

Let me know if this gets things working or not.

Regards, Freddie.

Hi Freddie,

It appears that didn’t work—PyFR complains about being unable to find the OpenMP header file. Looking at the compiler.py file in ${PYFR_ROOT}/pyfr/backends/openmp/ on line 55 you are using the value of cc to get the path of the compiler to be used. Unfortunately, on OS X this is a symbolic link to clang.

Is there an environment variable that PyFR supports that will allow you to change which c compiler is used? Setting the shell environment variable CC is ignored, so I was hoping there might be an alternative way to explicitly specify the compiler PyFR uses. I have both gcc-4.9 (4.9.2) and have built an OpenMP compatible version of clang which I’ve named clang-omp to test; however, I can’t figure out how to direct PyFR to use those, rather than the cc symbolic link (which points to clang bundled with Apple’s XCode command line tools).

Note, you also outlined an example of getting the copy of gcc-4.8 installed on your mac to compile a simple Hello World example. I believe if you replaced the -march=native option with something like -msse4.2 or -mtune=native, then the code snippet compiles without the error. Although, I’m not certain that is relevant to what you were pointing out.

Best Regards,

Zach

The compiler used by PyFR can be changed in the configuration file. For
example on my Linux system I have:

[backend-openmp]
cc = gcc-4.8.3

if you are going to be experimenting you might want to put

[backend-openmp]
cc = ${CC}

and then you can simply export CC in your shell to be your desired
compiler. We do not currently support expansions such as:

[backend-openmp]
cc = gcc -fsomething

the 'cc' field must be an executable. Similarly, we do not -- currently
-- permit one to append arguments to the compiler invocation. (Although
this is can be trivially accomplished with a one-line shell script
should any user require this feature.)

With regards to GCC on the Mac, yes it is -march=native that is causing
the trouble. I would, however, rather that compilers not try and emit
assembly instructions which they know can not be assembled on the
current system!

Regards, Freddie.

Monday, 3 November 2014

Hi Freddie,

I’m still looking into the OpenCL backend, but I think I was finally able to get the OpenMP backend up and running under OS X. I’ve collected a few relative performance benchmark results that perhaps some in the community might be interested in. Ideally, I would like to apply this test to see similar results for both CUDA and OpenCL backends.

My test system was a modest 2.4 GHz (i7–3630QM) quad-core Intel Core i7 Ivy Bridge processor with 16GB 1600 MHz DDR3 RAM running OS X v10.10. I modified the compiler flags in ${PYFR_ROOT}/pyfr/backends/openmp/compiler.py replacing the -march=native option with -mtune=native. This change isn’t necessary for the clang-omp compiler, but was necessary for the gcc–4.9 compiler I tried. To keep things consistent, I left that option changed across compilers. I took the fastest run in my test matrix (i.e. the very last case) and re-ran the test while reverting the -mtune=native option back to -march=native and observed no change in runtime.

I ran the couette_flow_2d example case using a single partition initiating pyfr-sim as follows:

pyfr-sim -p -n 100 -b openmp run couette_flow_2d.pyfrm couette_flow_2d.ini

The results for the openmp backend tests follow:



Backend

|

Compiler

|

Environment

|

Time

|

  • | - | - | - |


    cblas-mt = Accelerate Framework

    |

    gcc–4.9

    |

    OMP_NUM_THREADS=4

    |

    07m 48s

    |


    cblas-st = Accelerate Framework

    |

    gcc–4.9

    |

    OMP_NUM_THREADS=4

    |

    10m 52s

    |


    cblas-mt = Accelerate Framework

    |

    gcc–4.9

    |

    OMP_NUM_THREADS=8

    |

    12m 20s

    |


    cblas-st = OpenBLAS 0.2.12

    |

    gcc–4.9

    |

    OMP_NUM_THREADS=4

    |

    11m 01s

    |


    cblas-mt = OpenBLAS 0.2.12

    |

    gcc–4.9

    |

    OMP_NUM_THREADS=4

    |

    07m 46s

    |


    cblas-mt = Accelerate Framework

    |

    clang-omp

    |

    OMP_NUM_THREADS=4

    |

    04m 24s

    |


    cblas-st = Accelerate Framework

    |

    clang-omp

    |

    OMP_NUM_THREADS=4

    |

    04m 23s

    |


    cblas-mt = OpenBLAS 0.2.12

    |

    clang-omp

    |

    OMP_NUM_THREADS=4

    |

    04m 12s

    |


    cblas-st = OpenBLAS 0.2.12

    |

    clang-omp

    |

    OMP_NUM_THREADS=4

    |

    04m 10s

    |

The third case run shows that hyperthreading is a no-no as I’m sure you’re already aware. I was actually surprised that Apple’s Accelerate Framework was less performant than OpenBLAS, and I’ve convinced myself that gcc–4.9 (v4.9.2) is garbage. To install an OpenMP version of clang I used homebrew and this brew recipe. I also had to compile and install Intel’s OpenMP Runtime Library. I downloaded the the version listed at the top of the table (Version 20140926), unpacked, and invoked make with make compiler=clang. Next, I moved the .dylib and __.h files to their respective lib and include directories under /usr/local. Lastly, I set the C_INCLUDE_PATH, CPLUS_INCLUDE_PATH to include /usr/local/include and the DYLD_LIBRARY_PATH to include /usr/local/lib.

Now something has recently changed with either pycuda under OS X or PyFR, because initiating a similar test using the cuda backend results in the following traceback:

pyfr-sim -p -n 100 -b cuda run couette_flow_2d.pyfrm couette_flow_2d.ini

Traceback (most recent call last): File “/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim”, line 112, in main() File “/usr/local/lib/python2.7/site-packages/mpmath/ctx_mp.py”, line 1301, in g return f(*args, **kwargs) File “/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim”, line 82, in main backend = get_backend(args.backend, cfg) File “/Users/zdavis/Applications/PyFR/pyfr/backends/init.py”, line 11, in get_backend return subclass_where(BaseBackend, name=name.lower())(cfg) File “/Users/zdavis/Applications/PyFR/pyfr/backends/cuda/base.py”, line 33, in init from pycuda.autoinit import context File “/usr/local/lib/python2.7/site-packages/pycuda/autoinit.py”, line 4, in cuda.init() pycuda._driver.RuntimeError: cuInit failed: no device

I remember when first installing and running PyFR (~v0.2) this worked just fine using the default backend. I’m curious what has changed.

Best Regards,

Zach

Hi Zach,

Many thanks for all this, and your continued interest in the project!

A couple of general points:

1.) If you are interested in comparisons between different backends, you may want to check this out: http://arxiv.org/abs/1409.0405

2.) When looking at absolute (and even relative) performance of different backends, the very small 2D example test cases are somewhat pathological; the matrices that end up being repeatedly multiplied together are not very big, and hence are unlikely to get a good fraction of peak out of dgemm

3.) Regarding failure of the CUDA backend. What GPU and version of CUDA do you have on the Mac?

Cheers

Peter

Hi Zach,

I’m still looking into the OpenCL backend, but I think I was finally
able to get the OpenMP backend up and running under OS X. I’ve collected
a few *relative* performance benchmark results that perhaps some in the
community might be interested in. Ideally, I would like to apply this
test to see similar results for both CUDA and OpenCL backends.

My test system was a modest 2.4 GHz (i7–3630QM) quad-core Intel Core i7
Ivy Bridge processor with 16GB 1600 MHz DDR3 RAM running OS X v10.10. I
modified the compiler flags in
${PYFR_ROOT}/pyfr/backends/openmp/compiler.py replacing the
-march=native option with -mtune=native. This change isn’t necessary for
the clang-omp compiler, but was necessary for the gcc–4.9 compiler I
tried. To keep things consistent, I left that option changed across
compilers. I took the fastest run in my test matrix (i.e. the very last
case) and re-ran the test while reverting the -mtune=native option back
to -march=native and observed no change in runtime.

I ran the couette_flow_2d example case using a single partition
initiating pyfr-sim as follows:

pyfr-sim -p -n 100 -b openmp run couette_flow_2d.pyfrm couette_flow_2d.ini

At Peter noted that Couette flow case isn't great for benchmarking
purposes. It has relatively few elements, all of which are 2D. As a
consequence overheads (from Python, starting/joining threads, BLAS) are
all very high relative to the runtime. Further, on many systems the
entire problem is able to fit within the last level cache of the CPU.
This can distort the numbers somewhat as kernels which are usually
memory bandwidth bound suddenly become FLOP bound.

As a starting point ~2500 third order hexahedral elements are usually
sufficient to amortise away any overheads and is reasonably realistic in
terms of the loading per CPU/GPU for a real-world simulation.

The results for the openmp backend tests follow:

[snip]

The third case run shows that hyperthreading is a no-no as I’m sure
you’re already aware. I was actually surprised that Apple’s Accelerate
Framework was less performant than OpenBLAS, and I’ve convinced myself
that gcc–4.9 (v4.9.2) is garbage. To install an OpenMP version of clang
I used homebrew and this brew recipe
<https://github.com/Homebrew/homebrew/pull/33278&gt;\. I also had to compile
and install Intel’s OpenMP Runtime Library
<https://www.openmprtl.org/download#stable-releases&gt;\. I downloaded the
the version listed at the top of the table (Version 20140926), unpacked,
and invoked make with make compiler=clang. Next, I moved the
*.dylib and/*/.h files to their respective lib and include directories
under /usr/local. Lastly, I set the C_INCLUDE_PATH, CPLUS_INCLUDE_PATH
to include /usr/local/include and the DYLD_LIBRARY_PATH to include
/usr/local/lib.

An important thing to consider when switching between cblas-st and
cblas-mt is what the underlying BLAS library is configured to do.
Passing a multi-threaded BLAS library to cblas-st will result in an
over-subscription of cores (the N BLAS calls will all themselves try to
launch N threads).

With OpenBLAS it is relatively simple to disable threading when the
library is compiled. If compiling multi-threaded OpenBLAS there is a
choice between using OpenMP and its own threading code.

I am unsure if it is possible to get accelerate not to multi-thread on
its own.

With regards to the performance of GCC 4.9. It would be interesting to
see if that carries forward when only a single thread is used (so
eliminating OpenMP library overheads and just comparing the quality of
the produced code).

Now something has recently changed with either pycuda under OS X or
PyFR, because initiating a similar test using the cuda backend results
in the following traceback:

pyfr-sim -p -n 100 -b cuda run couette_flow_2d.pyfrm couette_flow_2d.ini

Traceback (most recent call last): File
"/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim", line 112, in
<module> main() File
"/usr/local/lib/python2.7/site-packages/mpmath/ctx_mp.py", line 1301, in
g return f(*args, **kwargs) File
"/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim", line 82, in
main backend = get_backend(args.backend, cfg) File
"/Users/zdavis/Applications/PyFR/pyfr/backends/__init__.py", line 11, in
get_backend return subclass_where(BaseBackend, name=name.lower())(cfg)
File "/Users/zdavis/Applications/PyFR/pyfr/backends/cuda/base.py", line
33, in __init__ from pycuda.autoinit import context File
"/usr/local/lib/python2.7/site-packages/pycuda/autoinit.py", line 4, in
<module> cuda.init() pycuda._driver.RuntimeError: cuInit failed: no device

I remember when first installing and running PyFR (~v0.2) this worked
just fine using the default backend. I’m curious what has changed.

At a Python prompt try:

  import pycuda.autoinit

and let us know what the outcome is. I believe some people are having
issues with CUDA/PyCUDA on Mac OS 10.10.

Regards, Freddie.

Tuesday, 4 November 2014

Hi Freddie & Peter,

Thanks for your input. It’s these nuances regarding PyFR’s use that make using it all the more familiar as I uncover them, so thanks for highlighting where my experiment may have gone astray. Peter, I took a look at the paper you provided when it was first announced, so I’m familiar with what sort of comparable performance to expect—I guess I was hoping to realize these same trends myself as a means to get more acquainted with PyFR. Both of your input and experience has helped in that regard.

With regards to pycuda on OS 10.10, importing the pucuda.autoinit module gives me the following stack trace:

Python 2.7.8 (default, Oct 17 2014, 18:21:39)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.

import pycuda.autoinit
Traceback (most recent call last):
File “”, line 1, in
File “/usr/local/lib/python2.7/site-packages/pycuda/autoinit.py”, line 4, in
cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no device

This is with a meager NVIDIA GT 650m (1024 MB VRAM) card; thus, why I resorted to using the simple 2D examples. I’ve got a Tesla K5000 in a workstation right next to me, so perhaps I’ll use it for some more testing; though, I was specifically interested in setting up and running PyFR under OS X, which I realize is probably a very niche use case. I was particularly interested in the OpenMP backend, because I can imagine someone may have a model that wouldn’t fit on the available memory of the GPUs we provide currently, so better understanding the performance trade-off of running on a GPU cluster as opposed to a more traditional cluster of CPUs was worthwhile to me.

Best Regards,

Zach Davis

Hi Zach,

Thanks for your input. It’s these nuances regarding PyFR’s use that
make using it all the more familiar as I uncover them, so thanks for
highlighting where my experiment may have gone astray. Peter, I took a
look at the paper you provided when it was first announced, so I’m
familiar with what sort of comparable performance to expect—I guess I
was hoping to realize these same trends myself as a means to get more
acquainted with PyFR. Both of your input and experience has helped in
that regard.

With regards to pycuda on OS 10.10, importing the pucuda.autoinit module
gives me the following stack trace:

Python 2.7.8 (default, Oct 17 2014, 18:21:39)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import pycuda.autoinit

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pycuda/autoinit.py", line
4, in <module>
    cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no device

This suggests that either PyCUDA or CUDA are not set up correctly. If
regular CUDA applications written in C/C++ work without issue then you
should try recompiling PyCUDA.

This is with a meager NVIDIA GT 650m (1024 MB VRAM) card; thus, why I
resorted to using the simple 2D examples. I’ve got a Tesla K5000 in a
workstation right next to me, so perhaps I’ll use it for some more
testing; though, I was specifically interested in setting up and running
PyFR under OS X, which I realize is probably a very niche use case. I
was particularly interested in the OpenMP backend, because I can imagine
someone may have a model that wouldn’t fit on the available memory of
the GPUs we provide currently, so better understanding the performance
trade-off of running on a GPU cluster as opposed to a more traditional
cluster of CPUs was worthwhile to me.

If you switch from double to single precision the card should be quite
capable. The 1024 MiB of memory behaves more like 2048 MiB and the
FLOP/S increase by a factor of ~24 or so. I expect this is enough to
run some interesting 3D cases (even if they're not scale resolving).

Regards, Freddie.

Hi Freddie,

You’re right—user error. The CUDA installation I had was 6.0.48. Clicking on the “Check For Updates” button within the CUDA System Preference was returning that no updates were available. Looking at NVIDIA's website, I noticed version 6.5 had been released. Installing it and re-installing pycuda resolved the issue. I’ll work on a larger model to make comparisons while investigating the OpenCL issue further. Thanks again for all of your expertise and feedback. I appreciate all the great work you and your team are doing.

Best Regards,

Zach