OpenMP gradient fusion

Frankx9 · 21 July 2025 23:18

Hi,

what is the reason for which the openMP doesn’t do gradient fusion? Purely optimization or anything else one should be aware?

Best

fdw · 21 July 2025 23:24

OpenMP has cache blocking which is a vastly more effective technique for reducing memory bandwidth requirements. Unfortunately, GPUs do not have big enough caches for blocking to make sense so there we apply gradient fusion to save a little bandwidth. It does slightly increase memory consumption, however.

Regards, Freddie.

Frankx9 · 21 July 2025 23:43

Thanks!
So if I force it to do gradient fusion I still have ‘correct’ results but less efficiency, do you confirm?

Best

fdw · 22 July 2025 01:08

Yes, the cache blocking code specifically applies an alternative grouping when gradient fusion is enabled to ensure correct results.

Regards, Freddie.

Frankx9 · 22 July 2025 08:53

Thanks, where does this happen?
I suppose it’s in the ._group method of the graph but it is a bit unclear the substitution mechanism. Could you elaborate more on that?

Best,

fdw · 22 July 2025 10:39

The relevant check is here:

github.com/PyFR/PyFR

pyfr/solvers/baseadvecdiff/system.py

acb33847e


      
          ]
          for ks in zip_longest(*kgroup):
              # Flux-AA on; inputs to tdisf and tdivtpcorf are from quad pts
              if k['eles/qptsu']:
                  subs = [
                      [(ks[0], 'out'), (ks[1], 'out'), (ks[2], 'gradu'),
                       (ks[4], 'b'), (ks[5], 'b')],
                      [(ks[6], 'out'), (ks[7], 'u')],
                      [(ks[5], 'out'), (ks[7], 'f'), (ks[8], 'b')],
                  ]
              # Gradient fusion on; tdisf_fused replaces tdisf and gradcoru_upts
              elif k['eles/tdisf_fused']:
                  subs = [
                      [(ks[0], 'out'), (ks[1], 'out'),
                       (ks[3], 'gradu'), (ks[4], 'b')],
                      [(ks[3], 'f'), (ks[8], 'b')],
                  ]
              # No flux-AA and no gradient fusion
              else:
                  subs = [
                      [(ks[0], 'out'), (ks[1], 'out'), (ks[2], 'gradu'),

Regards, Freddie.

Frankx9 · 22 July 2025 18:25

Thanks,

there for example could you elaborate more on the logic?
there is a list of tuples, each tuple has the first argument ks[0] that refers to a scheduled kernel on the backend, but the second argument? What is the role of 'out', 'gradu' ecc? And how tuple i-th tuple to the next ones?

fdw · 22 July 2025 18:39

Each list corresponds to a single backend matrix. The line:

[(ks[3], 'f'), (ks[8], 'b')]

says that argument f of kernel ks[3] is the same as argument b of ks[8] and that we wish to eliminate intermediate writes to that array. (On in physical terms: don’t bother writing the entire flux out to memory only to immediately read it back in to compute its divergence; rather evaluate part of the flux to a temporary array and then immediately compute its divergence.)

As you can see the reuse chains can be quite extensive and this can save massive amounts of memory and memory bandwidth.

Regards, Freddie.

Frankx9 · 23 July 2025 13:01

Thanks a lot now it is more clear!
Two more things:
First:
the argument 'b' what does it refer to in the matmul?

And second:

[(ks[0], 'out'), (ks[1], 'out'), (ks[3], 'gradu'), (ks[4], 'b')],

When multiple tuples are there the chain, is the chaining continuous? E.g. out from ks[0] is the same as ‘out’ from ks[1] which is then the same as ‘gradu’ from ks[3] and then ‘b’ from ks[4]?

Best

fdw · 23 July 2025 14:20

For the multiplication it is:

github.com/PyFR/PyFR

pyfr/backends/openmp/xsmm.py

5060a1b4c


      
          
              self._wrappers.libxsmm_fsspmdm_destroy(blkptr)
          
              if blkptr_nt != blkptr:
                  self._wrappers.libxsmm_fsspmdm_destroy(blkptr_nt)
          
          def __del__(self):
              if hasattr(self, '_wrappers'):
                  self._wrappers.libxsmm_finalize()
          
          def mul(self, a, b, out, alpha=1.0, beta=0.0):
              # Ensure the matrices are compatible
              if a.nrow != out.nrow or a.ncol != b.nrow or b.ncol != out.ncol:
                  raise ValueError('Incompatible matrices for out = a*b')
          
              # Check that A is constant
              if 'const' not in a.tags:
                  raise NotSuitableError('libxsmm requires a constant a matrix')
          
              # Check n is suitable
              if b.leaddim % self._nmod != 0:

so b is the input.

Yes, the chain is continuous, like I said you can get a lot of reuse with cache blocking (which is why it is so effective at reducing bandwidth requirements).

Regards, Freddie.

Topic		Replies	Views
Questions about openmp backend performance General	3	205	29 June 2015
OpenCL Runtime Error General	17	349	4 November 2014
PyFR 0.8.0: Openmp backend compilation failure General	3	167	26 June 2015
Compilation failure with openmp backend Just Starting	9	354	27 May 2017
Gradient fusion Development	5	40	27 May 2025

OpenMP gradient fusion

Related topics