Discrete convolution by direct summation - can this be made more efficient?

Good to hear!

The other way to do this is just do what you were doing but copy things around to put all the multiplication n’ such into big matvecs:

row_vector[ikr] row_kernel = kernel';
matrix[ikr, nr - kl1] Xtmp;
for(j in 1:nc) {
  for(i in 1:(nr - kl1)) {
    Xtmp[,i] = X[i:(i + kl1), j]
  }

  Y[, j] = row_kernel * Xtmp;
}

Dunno if it’d perform any better/worse. Good luck!