a clarification regarding the element index in the reduction kernel.
Specifically for OpenMP backend I can access the flattened index associated to a variable at a given quadrature point in a given element as: idx = _y*BLK_SZ*${ncola} + ib*BLK_SZ*${ncola}*nrow + X_IDX_AOSOA(${i}, ${ncola});but I have the need of accessing a value of an array that has just nrow = 1 and ncola = 1 (1 variable).
Is therefore correct to extract the associated element index as: idx = ib*BLK_SZ*${ncola}*nrow + X_IDX_AOSOA(0, ${ncola});
and for CUDA backend I have a flattened index as: ixdtype_t idx = j*ldim + SOA_IX(i, blockIdx.y, gridDim.y);
hence the element index should be
ixdtype_t i = ixdtype_t(blockIdx.x)*blockDim.x + tid;
I do not believe the reduction code has ever been tested on a matrix with a 2D ioshape. It is possible that some indices (sizes) are being picked up incorrectly. Does anything change if you go for an ioshape of (1, 1, 52547)?
The reduction code is reimplemented by each backend on its own. Thus, the implementations are all slightly different and may behave differently with non-standard (i.e., 2D) matrices. If you use matrices with a 3D ioshape (which is the common case) everything should work.