Repeated matrix operations -- big speedup / memory use reduction through independent subsets, other ideas?

The ctsem software (hierarchical state space modelling in continuous / discrete time, model coded in stan) requires the evaluation of two matrix exponentials (as well as a bunch of other, less costly, matrix operations) for each timepoint of data, one for the state matrix (governing how the system behaves with respect to itself) and one for the jacobian of the state matrix. This allows an extended Kalman filtering approach to integrate out the latent states.

Last week I was asked about a large problem where the system was running out of memory, and ended up addressing it by modifying ctsem to pre-compute independent subsets of the state and jacobian matrices, then loop over these subsets and compose the full matrices. This gave about a 300% improvement in performance and memory usage, which is wonderful. I guess the same approach could be applied to other matrix operations so thought mentioning it might be helpful, though don’t know if it would have the same impact. Since there was such an improvement tucked away for relatively little difficulty implementing, I’m wondering if anyone has other ideas I might be missing out on :)

subset matrix exponential function for the interested… where the subsets integer array is of size n by nrows, n is the number of subsets, nrows is the number of rows of the square matrix m. the first elements of each subsets[i,] array denote the indices of the subset, the rest should be zero.

  matrix expmSubsets(matrix m, int[,] subsets){
    int nr = rows(m);
    matrix[nr,nr] e = rep_matrix(0,nr,nr);
    for(si in 1:size(subsets)){
      int n=0;
      for(j in 1:nr){
        if(subsets[si,j]!=0) n+=1;
      if(n > 1) e[subsets[si,1:n],subsets[si,1:n] ] = matrix_exp(m[subsets[si,1:n],subsets[si,1:n]]);
      if(n == 1) e[subsets[si,1],subsets[si,1] ] = exp(m[subsets[si,1],subsets[si,1]]);
    return e;

Hi, @Charles_Driver. This is a typical kind of caching speedup where you can precompute a matrix of values and reuse it. This can be super helpful any time the exact same set of predictors gets reused for multiple items in a model.

Were you asking about speeding up the code you attached? It can take some squinting at the algebra and structure of what’s going on to find all the places where you can cache answers. Which is hard for someone on the outside to do because it entails understanding what this is doing.

There are more suggestions in the efficiency chapter of the User’s Guide. One thing to watch out for is conditionals—they’re very expensive compared to arithmetic. So rather than

for (j in 1:nr)
  if (subsets[si, j] != 0)
    n += 1;

I would recommend

int N = rows(m);
for (n in 1:N)
  n += subsets[si, j] != 0;

But you might be able to get away with precomputing a bunch of these indexes in the transformed data block if the subsets argument never changes here.

Very cool! Agree with Bob about precomputing the n in transformed parameters. are the slices in subsets contiguous? If so then instead of storing each cell you need via subsets[si,1:n] you could just store the beginning and end of each slice in an array[sub_size, 2]. Then in the loop here when n > 1 you can do

     * Make subsets in transformed parameters
     * First value in each row is the starting index for the subset
     * Second value in each row is end index of the subset
  matrix expmSubsets(matrix m, int[,] subsets){
    int nr = rows(m);
    matrix[nr,nr] exp_mat = rep_matrix(0,nr,nr);
    for(si in 1:size(subsets)) {
      if(subsets[si,2] > subsets[si,1]) {
        exp_mat[subsets[si,1]:subsets[si,2], subsets[si,1]:subsets[si,2] ] =
          matrix_exp(m[subsets[si,1]:subsets[si,2], subsets[si,1]:subsets[si,2]]);
      } else if(subsets[si,2] == subsets[si,1]) {
        exp_mat[subsets[si,1], subsets[si,1] ] = exp(m[subsets[si,1]subsets[si,1]]);
      return exp_mat

Subsetting by an array like e[subsets[si,1:n],subsets[si,1:n] ] forces a copy since the array values could be non-contiguous. But if you have contiguous values then doing the from:to just makes a view into that slice of the matrix.It won’t be a huge speedup but it should be a bit faster

Also if you must do one or the other then you can get rid of the else if

Thanks both, didn’t know that about conditionals. The subsets are not contiguous, but given that this operation needs to be done twice for every observation and seems to be about 70% of the overall iteration cost, I might see how much of a headache it is to rearrange the matrix internally so the sets are contiguous. @Bob_Carpenter , ideas based on the code are welcome, it was more a request for general thoughts on this kind of repeated, somewhat sparse, matrix operation, which I appreciate may not be easy – this was just so substantial and hadn’t occurred to me before, don’t see anything like it in the efficiency chapter either.

The only general advice is to cache things rather than recomputing if they involve autodiff variables. This can get almost arbitrarily hard, as you can see in something like an HMM where you have a fairly complex memoization scheme to reduce an exponential problem to a linear one.

1 Like