Elementwise multiplications with row sums appear to be slow

Hi all,

I am working on age dependent branching processes where we have recently cast the resultant renewal equations into a simple numerical solving algorithm relying on element wise multiplications of three matrices and a row sum. These matrices are not huge, maybe n=500 is the largest case.

After timing my code, the loop doing the heavy lifting is as follows. Essentially, you have three matrices, A, B and C, and you populate the B matrix iteratively in a loop. For now assume A and C as fixed data. A barebones code block is as follows:

(note a) I do some matrix indexing already to help speed things up and not waste computation b) ignore for now that i starts from 3)

transformed parameters {
    matrix[N,N] B = rep_matrix(0,N,N);
    vector[N] convolution=rep_vector(0,N);
    vector[N] E_inc=rep_vector(0,N);
    for(i in 3:N){	
      convolution = (revmat(C[1:N,1:(i-1)]).*A[1:N,1:(i-1)].*B[1:N,1:(i-1)])*rep_vector(1.0,(i-1));
      B[1:N,i] = rep_vector(1,N) + convolution;
    }
    E_inc = diagonal(B);
} 

There is a function i have coded called revmat which simply reverses the columns.

  matrix revmat(matrix X){
    int n = rows(X);
    int n2 = cols(X);
    matrix[n,n2] s;
    for(i in 1:n2){
      s[1:n,i] = X[1:n,n2-i+1];
    }
    return s;
  }

This loop, even for modest N=100 is impractically slow and would take weeks to generate even small samples. Its also pretty slow ‘optimizing’ too, but that just about works in a sensible time. My questions are:

  1. Is there an obvious thing I am doing wrong that is slowing things down? After combing this forum I have seen best practice is to vectorise and thats what i’ve tried to do.

  2. Mathematically, I could cast the problem as three nested loops instead of as a matrix operation - would this work better in stan?

  3. For reference, in base R the above computation for N=500, runs in 0.25-0.3 seconds.

Ive always thought modern computers excel at element wise multiplications and row sums, so the slowness was a surprise to me, but this probably exposes my general ignorance than anything meaningful.

I very much appreciate any suggestions

Sam

p.s the paper behind what Im trying to do is here [2107.05579] Unifying the effective reproduction number, incidence, and prevalence under a stochastic age-dependent branching process where I use stan in optimizing mode.

Try to turn the revmat function use a double loop and most importantly try to use the specialized functions for reductions from stan (like dot_product, crossprod, and all these).

Make all the rep_* things in transformed data

transformed data {
    matrix[N,N]  zero_mat = rep_matrix(0,N,N);
    vector[N] zero_vec = rep_vector(0,N);
   vector[N] one_vec = rep_vector(1, N);
 }
...
transformed parameters {
    matrix[N,N] B = zero_mat;
    vector[N] convolution =  zero_vec;
    vector[N] E_inc = zero_vec;
    for(i in 3:N){	
      convolution = (revmat(C[1:N,1:(i-1)]) .* A[1:N,1:(i-1)] .* B[1:N,1:(i-1)]) * one_vec[1:i - 1];
      B[1:N,i] = one_vec + convolution;
    }
    E_inc = diagonal(B);
} 

Next, the multiplication of column i can be cached so that you only do it once. You can get rid of some of the assignments of zeros (unless you use B later, b/c it will have NAs in the first 2 columns other than the diagonal).

transformed data {
   vector[N] one_vec = rep_vector(1, N);
 }
...
transformed parameters {
    matrix[N, N] B;
    vector[N] E_inc;
   matrix[N,N] cache;
  
   E_inc[1:2] = [0.0, 0.0]';
   cache[, 1:2] = C[1:N,1:2]) .* A[1:N,1:2] .* B[1:N,1:2];
   for(i in 3:N){	
      cache[, i] = C[1:N, i ] .* A[1:N, i] .* B[1:N, i];
      B[1:N, i] = one_vec + revmat(cache[ , 1:i - 1]) * one_vec[1:i - 1];
    }
    E_inc[3:N] = diagonal(B[3:N, 3:N]);
} 

See if that helps. Though be cautious as I haven’t tested it.

Thank you for your time @spinkney .The transformed data thing is really sensible, i suppose this works better for memory management?

Sadly however I’m still seeing:

1000 transitions using 10 leapfrog steps per transition would take 1194.05 seconds (with or without a likelihood) - a single computation for the same data takes 0.05 seconds in base R.

For N=100, Perhaps I am being ignorant (do let me know if this is case) and what I am asking is more computation than I realise, but for 100 x 100 matrices with row sums I would think it faster than this?

Regarding the Caching, I had thought to do that, but the revmat is only on C, and so each iteration requires recalculation as the C is different. I will think more on this though.

Thank you also for your time @wds15 - so your suggestion would be to have two loops with the inbuilt dotproduct? This is what we have done in the past for simpler convolutions, however here I am not sure it will be faster. Ill try.

Ah yea, too bad. Probably best using @wds15 suggestions then.

What’s the purpose of multiplying by a vector of ones?

Just to perform a row sum. I could write a function to do this and have, but found that a matrix vector multiplication and a looped row sum had almost the same overhead

Ah, gotcha.

And double-checking my understanding: with your code, B[1:N, 1:2] will always be zeroes, and B[1:N, 3] will always be ones, right?

No sorry, I should have been more clear there. The A matrix will be populated with positive real values. The C matrix is populated with positive real values bounded between 0 and 1 (its a probability). The B matrix in a nutshell does a discrete convolution for N shifts. The matrix computation is supposed to be a more computationally efficient way to numerically solve the following renewal equation via Riemann sums

b(t,l) = h(t,l) + \int_{u=0}^t a(l+u)b(t-u,l + u)c(u) du

A few other details - all matrices are lower triangular hence the indexing 1:(i-1) and not 1:N. There are some annoying details I won’t bore you with

when i==3, the first iteration of the loop, B is still full of zeroes, so won’t the element-wise product

C[1:N,1:(i-1)]) .* A[1:N,1:(i-1)] .* B[1:N,1:(i-1)]

necessarily then be zero?

For loops work really well in Stan. Moving whatever you can into transformed data is very important. Finally, when there is a Stan function for any reduction, then by all means, you should use it. Also have a look at profiling your code to now exactly where things take time.

Sorry again, for brevity I ignored the initial conditions h as they wont affect the computation speed

    B[1:N,1] = rep_vector(1,N);
    B[1:N,2] = rep_vector(1,N)+(C[1:N,1].*A[1:N,1].*B[1:N,1]);

Without this initial condition yes there is only one solution, the trivial one of zero

Thanks @wds15 ill try to change everything to a loop and see how that works

Gotcha.

I’m not 100% confident in this, but I think it’d help to pre-compute the A.*C products before the loop and index into that

Well, definitely don’t bother trying to turn matrix .* matrix to a loop, @wds15 was merely advising that you shouldn’t (as many R users do) fear that loops are dramatically inefficient in Stan.

I do really like this idea. Apologies to ask to further pick your brain, what would be the best structure to store these precomputed matrices - a multi dimensional array?

Regarding the loop, I mean I could do the direct Riemann sum

    b[1,1:N]=rep_row_vector(1,N);
    for(i in 2:N){
      for(l in 1:(N-i+1)){
        convolution = 0;
        for(j in 1:(i-1)){
          convolution = convolution +   a[l+j-1]*b[i-j,l+j]*c[j];
        }
        b[i,l] = 1 + convolution;
      }
    }

But initial indications suggest this is not much faster with a likelihood. I suppose maybe this is my answer, there are more computations that i think

Use profiling while testing any attempts at compute optimization.

Oops, my bad; in my translation of your code to R so I could play, I put the revmat on the full product, but I see it’s just on C which means my idea that there was redundant computations that could be pre-computed is incorrect.

But they’re parameters in your real use-case, yes?

depends on the modelling context and what we know, normally A would be a random walk process or spline basis, and C would be fixed before hand. So maybe disregard that assumption - i did not want to wade into this yet given even with them fixed things were intractable slow