OpenMP gradient fusion

Hi,

what is the reason for which the openMP doesn’t do gradient fusion? Purely optimization or anything else one should be aware?

Best

OpenMP has cache blocking which is a vastly more effective technique for reducing memory bandwidth requirements. Unfortunately, GPUs do not have big enough caches for blocking to make sense so there we apply gradient fusion to save a little bandwidth. It does slightly increase memory consumption, however.

Regards, Freddie.

Thanks!
So if I force it to do gradient fusion I still have ‘correct’ results but less efficiency, do you confirm?

Best

Yes, the cache blocking code specifically applies an alternative grouping when gradient fusion is enabled to ensure correct results.

Regards, Freddie.

Thanks, where does this happen?
I suppose it’s in the ._group method of the graph but it is a bit unclear the substitution mechanism. Could you elaborate more on that?

Best,

The relevant check is here:

Regards, Freddie.

Thanks,

there for example could you elaborate more on the logic?
there is a list of tuples, each tuple has the first argument ks[0] that refers to a scheduled kernel on the backend, but the second argument? What is the role of 'out', 'gradu' ecc? And how tuple i-th tuple to the next ones?

Each list corresponds to a single backend matrix. The line:

[(ks[3], 'f'), (ks[8], 'b')]

says that argument f of kernel ks[3] is the same as argument b of ks[8] and that we wish to eliminate intermediate writes to that array. (On in physical terms: don’t bother writing the entire flux out to memory only to immediately read it back in to compute its divergence; rather evaluate part of the flux to a temporary array and then immediately compute its divergence.)

As you can see the reuse chains can be quite extensive and this can save massive amounts of memory and memory bandwidth.

Regards, Freddie.

1 Like

Thanks a lot now it is more clear!
Two more things:
First:
the argument 'b' what does it refer to in the matmul?

And second:

[(ks[0], 'out'), (ks[1], 'out'), (ks[3], 'gradu'), (ks[4], 'b')],

When multiple tuples are there the chain, is the chaining continuous? E.g. out from ks[0] is the same as ‘out’ from ks[1] which is then the same as ‘gradu’ from ks[3] and then ‘b’ from ks[4]?

Best

For the multiplication it is:

so b is the input.

Yes, the chain is continuous, like I said you can get a lot of reuse with cache blocking (which is why it is so effective at reducing bandwidth requirements).

Regards, Freddie.