RocBLASInvalidValue

Hi,

I am trying to use pyfr2.0.3 with rocm6.0.3 to run the test case but with error:

Traceback (most recent call last):
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/backends/hip/rocblas.py", line 121, in mul
    algo, dt = self._mul_cache[ckey]
               ~~~~~~~~~~~~~~~^^^^^^
KeyError: (<class 'numpy.float64'>, 1.0, 0.0, 768, 16, 16, 768, 64, 768)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/bin/pyfr", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/__main__.py", line 124, in main
    args.process(args)
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/__main__.py", line 258, in process_run
    _process_common(
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/__main__.py", line 243, in _process_common
    solver = get_solver(backend, rallocs, mesh, soln, cfg)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/solvers/__init__.py", line 14, in get_solver
    return get_integrator(backend, systemcls, rallocs, mesh, initsoln, cfg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/integrators/__init__.py", line 34, in get_integrator
    return integrator(backend, systemcls, rallocs, mesh, initsoln, cfg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/integrators/dual/phys/controllers.py", line 6, in __init__
    super().__init__(*args, **kwargs)
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/integrators/dual/phys/steppers.py", line 14, in __init__
    super().__init__(*args, **kwargs)
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/integrators/dual/phys/base.py", line 23, in __init__
    self.pseudointegrator.commit()
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/integrators/dual/pseudo/multip.py", line 127, in commit
    s.system.commit()
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/solvers/base/system.py", line 67, in commit
    self._gen_kernels(self.nregs, self.ele_map.values(), self._int_inters,
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/solvers/base/system.py", line 202, in _gen_kernels
    kern = kgetter(i)
           ^^^^^^^^^^
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/solvers/baseadvec/elements.py", line 74, in <lambda>
    kernels['disu'] = lambda uin: self._be.kernel(
                                  ^^^^^^^^^^^^^^^^
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/backends/base/backend.py", line 196, in kernel
    kern = kern_meth(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/backends/hip/rocblas.py", line 143, in mul
    dt = self._benchmark(gemm)
         ^^^^^^^^^^^^^^^^^^^^^
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/backends/hip/provider.py", line 47, in _benchmark
    kfunc(stream)
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/backends/hip/rocblas.py", line 114, in gemm
    w.rocblas_gemm_ex(
  File "/pfs/lustrep2/projappl/project_465000921/zhenyang/ENV_rocm6/pyfr203_rocm6_lintrip/lib/python3.11/site-packages/pyfr/ctypesutil.py", line 37, in _errcheck
    raise self._statuses[status]
pyfr.backends.hip.rocblas.RocBLASInvalidValue
srun: error: nid007964: task 0: Exited with exit code 1
srun: Terminating StepId=8009219.0

The computer system is just updated. I wonder if it is the problem with the cluster.

Regards,
Zhenyang

This is quite likely. Is there any chance you can swap out rocBLAS versions?

Regards, Freddie.

Hi,

I double checked with the cluster admin, the version of rocblas is 4.0.0. Here I also quote some of his replies:

solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index,
this controls which solution is used. When algo is not
rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default
solution is used. This parameter was unused in previous releases and instead
always used the default solution

 

Now if I edit the code and change the value of the "algo" parameter, i.e.,
changing this piece of code:

w.rocblas_gemm_ex(
> h, opA, opB, m, n, k, byref(alpha_ct), A, rtype, A.leaddim, B,
> rtype, B.leaddim, byref(beta_ct), C, rtype, C.leaddim, C,
> rtype, C.leaddim, rtype, w.GEMM_ALGO_SOLUTION_INDEX, algo, 0
)

to this

w.rocblas_gemm_ex(
> h, opA, opB, m, n, k, byref(alpha_ct), A, rtype, A.leaddim, B,
> rtype, B.leaddim, byref(beta_ct), C, rtype, C.leaddim, C,
> rtype, C.leaddim, rtype, w.GEMM_ALGO_SOLUTION_INDEX, 0, 0
)

PyFR runs.

Could you help checking this?

Regards,
Zhenyang

This change will effectively break the auto-tuning aspect of our rocBLAS integration (ask for a bunch of GEMM algorithms and then use the best one) by forcing us to always use the default algorithm which often performs poorly.

The error is likely in rocBLAS (after all it is telling us what the valid values of algo should be and then complaining when we try one of them).

Regards, Freddie.

1 Like