Discrete convolution by direct summation - can this be made more efficient?

I have a function that does essentially the same thing as R’s filter() with method=“convolution” (I think there’s a very similar function in scipy). Here’s the function:

functions {
  matrix convolve(matrix X, vector kernel) {
    int nr = rows(X);
    int nc = cols(X);
    int kl = rows(kernel);
    int kl1 = kl-1;
    matrix[nr-kl1, nc] Y;
    for (j in 1:nc) {
      for (i in 1:(nr-kl1)) {
        Y[i, j] = dot_product(X[i:(i+kl1), j], kernel);
      }
    }
    return Y;
  }
}

And here is a comparison to filter():

> require(rstan)
> require(rbenchmark)
> 
> kernel <- c(1:6,5:1)
> kernel <- kernel/sum(kernel)
> set.seed(1234)
> X <- matrix(rnorm(3000*40),3000, 40)
> npvd <- stanc("npvd.stan")

starting worker pid=6184 on localhost:11959 at 18:14:19.897
> expose_stan_functions(npvd)
> 
> benchmark(Y1 <- convolve(X, kernel), Y2 <- filter(X, kernel, method="convolution"),
+           replications=1000)
                                             test replications
1                       Y1 <- convolve(X, kernel)         1000
2 Y2 <- filter(X, kernel, method = "convolution")         1000
  elapsed relative user.self sys.self user.child sys.child
1   5.990    1.099     5.951    0.032          0         0
2   5.451    1.000     5.427    0.013          0         0
> all.equal(Y1, Y2[6:2995,])
[1] TRUE
>

So this isn’t bad – the Stan version essentially ties R’s filter(), which appears to wrap some R bookkeeping code around C calls. Is there an obvious way to make it better?

The Stan code for the entire model is below. Basically it does linear regression on a set of predictors convolved with a simplex kernel. The predictors started out as eigenvectors from a principal components analysis and those are unit scaled. Yes the weird scaling of the observed data was intentional.

The model works very well and is producing results I believe. I’d be happy if it were faster. There may be other ways besides the convolution to do that. For example I did some sort of cross-validation exercise on the principal components that told me I should use 42 of them. That might be overkill. The kernel could use a tighter prior (I think it should have centrally concentrated mass) – I don’t know if that would help with speed though.

One other question. Variable declaration and assignment in a single statement works (I’m using rstan 2.14.1 at the moment). I haven’t seen any example code where this is done and don’t recall an explicit statement in the language manual that this is OK. Is this just a matter of programming style? A new feature? A deprecated feature?

Code follows:

functions {
  matrix convolve(matrix X, vector kernel) {
    int nr = rows(X);
    int nc = cols(X);
    int kl = rows(kernel);
    int kl1 = kl-1;
    matrix[nr-kl1, nc] Y;
    for (j in 1:nc) {
      for (i in 1:(nr-kl1)) {
        Y[i, j] = dot_product(X[i:(i+kl1), j], kernel);
      }
    }
    return Y;
  }
}
data {
  int<lower=1> nf;  //number of flux values
  int<lower=1> kl; //kernel length
  int<lower=1> nc;
  vector[nf] gflux;
  vector<lower=0>[nf] g_std;
  matrix[nf+kl-1, nc] X;
}
transformed data {
  vector[nf] y;
  vector[nf] y_std;
  real mu_g = mean(gflux);
  y = (gflux-mu_g)/mu_g;
  y_std = g_std/mu_g;
}
parameters {
  simplex[kl] kernel;
  vector[nc] beta;
  real alpha;
}
model {
  matrix[nf, nc] X_c;
  
  X_c = convolve(X, kernel);
  
  alpha ~ normal(0., 5.);
  beta ~ normal(0., 50.);
  y ~ normal(alpha + X_c * beta, y_std);
}
generated quantities {
  vector[nf] gfit;
  vector[nf] gflux_rep;
  vector[nf] log_lik;
  
  gfit = mu_g + mu_g*(alpha + convolve(X, kernel) * beta);
  for (i in 1:nf) {
    gflux_rep[i] = normal_rng(gfit[i], g_std[i]);
    log_lik[i] = normal_lpdf(gflux[i] | gfit[i], g_std[i]);
  }
}

It doesn’t look like there’s an FFT yet (https://github.com/stan-dev/math/issues/20), so your options are probably just to try to rewrite things using Matrix-vector stuff as much as possible. I think you can change:

for (j in 1:nc) {
  for (i in 1:(nr-kl1)) {
    Y[i, j] = dot_product(X[i:(i+kl1), j], kernel);
  }
}

Into something like:

row_vector[kl] row_kernel = kernel';
for (i in 1:(nr-kl1)) {
  Y[i,] = row_kernel * X[i:(i+kl1),];
}

(Notice the transpose on the kernel). Make sure and check this in R first to make sure I didn’t goof on the translation…

That worked! Thanks!

I get a better than factor of two improvement in execution time of the function with your suggested modification and the full model run on real data is ~30-40% faster.

Benchmark with new function temporarily added to code as “convolve2”:

> require(rstan)
> require(rbenchmark)
> 
> kernel <- c(1:6,5:1)
> kernel <- kernel/sum(kernel)
> set.seed(1234)
> X <- matrix(rnorm(3000*40),3000, 40)
> npvd <- stanc("npvd.stan")

Zernike loaded at startup
starting worker pid=7975 on localhost:11959 at 08:55:48.189
> expose_stan_functions(npvd)
> 
> benchmark(Y1 <- convolve(X, kernel), 
+           Y2 <- convolve2(X, kernel),
+           Y3 <- filter(X, kernel, method="convolution"),
+           replications=1000)
                                             test replications
1                       Y1 <- convolve(X, kernel)         1000
2                      Y2 <- convolve2(X, kernel)         1000
3 Y3 <- filter(X, kernel, method = "convolution")         1000
  elapsed relative user.self sys.self user.child sys.child
1   5.943    2.617     5.886    0.046          0         0
2   2.271    1.000     2.251    0.017          0         0
3   5.206    2.292     5.183    0.013          0         0
> all.equal(Y1, Y3[6:2995,])
[1] TRUE
> all.equal(Y1, Y2)
[1] TRUE

Revised function:

  matrix convolve(matrix X, vector kernel) {
    int nr = rows(X);
    int nc = cols(X);
    int kl = rows(kernel);
    int kl1 = kl-1;
    row_vector[kl] kernelp = kernel';
    matrix[nr-kl1, nc] Y;
    for (i in 1:(nr-kl1)) {
        Y[i, ] = kernelp * X[i:(i+kl1), ];
    }
    return Y;
  }

Good to hear!

The other way to do this is just do what you were doing but copy things around to put all the multiplication n’ such into big matvecs:

row_vector[ikr] row_kernel = kernel';
matrix[ikr, nr - kl1] Xtmp;
for(j in 1:nc) {
  for(i in 1:(nr - kl1)) {
    Xtmp[,i] = X[i:(i + kl1), j]
  }

  Y[, j] = row_kernel * Xtmp;
}

Dunno if it’d perform any better/worse. Good luck!

Thanks again. It took me a while to figure out what you were suggesting. That code throws an error because the parser sees a row vector being assigned to a vector. This works though:

Y[, j] = (row_kernel * Xtmp)';

This is slower by ~60-80% than your first suggestion and just ties my first naive attempt, but it was worth a try!

1 Like