Good to hear!
The other way to do this is just do what you were doing but copy things around to put all the multiplication n’ such into big matvecs:
row_vector[ikr] row_kernel = kernel';
matrix[ikr, nr - kl1] Xtmp;
for(j in 1:nc) {
for(i in 1:(nr - kl1)) {
Xtmp[,i] = X[i:(i + kl1), j]
}
Y[, j] = row_kernel * Xtmp;
}
Dunno if it’d perform any better/worse. Good luck!