Tip for speeding up models with repeated observations

Just thought I’d share a performance tip that could yield 2-5x speedups if you have a model where there are repeated observations for each of the unique combinations of predictors.

Attached is a demo of a hierarchical model with a binomial outcome and two 2-level predictors, one “within-subjects” and one “between-subjects”; the code would work for more predictors or levels therein, but I thought I’d keep it simple for this example. The speedup should also work for non-hierarchical models; I just happened to have a demo of the hierarchical version mostly coded already.

The “slower” model is pretty much identical to that used in the manual for the 8-schools data in section 9.13 (with the exception of variable names & binomial outcome), while the “faster” model is differentiated by slightly different data variables and replacement of the line:

y ~ bernoulli_logit( rows_dot_product( s_vals[s] , w ) ) ;

with:

{
  // mu: condition-by-subject values for the latent mean
  matrix[rows_w,n_s] mu ;
  for(i in 1:n_s){
    for(j in 1:rows_w){
      mu[j,i] = dot_product(s_vals[i],w[j]) ;
    }
  }
  y ~ bernoulli_logit( to_vector(mu)[obs_index_in_flat_mu] ) ;
}

So, the trick is to pre-compute in R the unique entries in the within-subject design matrix, so that you can then have Stan compute the unique entries in the subject-by-condition matrix (mu). You also pre-compute (again, in R) the index into a flattened version of mu of each observation so you can have a final vectorized call to bernoulli_logit().

Hopefully the R code is well-commented enough to be clear, but let me know if you have any questions.

p.s. There’s actually still some redundant computation in the between-subjects design matrix that you could also apply this trick to, but I usually have much less redundancy here than in the within-subject matrix, so I haven’t gotten around to coding that one as more efficient yet.

binomial_mixed_demo.R (6.1 KB)
generate_data.R (2.0 KB)
binomial_mixed_slower.stan (1.9 KB)
binomial_mixed_faster.stan (2.1 KB)

2 Likes

Thanks for sharing. I discuss some simple cases in the chapter on efficiency under the heading “sufficient statistics” (here, just the repeated observation count).

You can also do this in transformed data within Stan. Not as convenient, but it leaves the interface to the Stan program cleaner.

As you have it, you could just compute a vector directly to avoid the copying by this:

int pos = 1;
for (i in ...)
  for (j in ...) {
    foo[pos] = ...
    pos += 1;
  }

It’ll be even faster if you reformulate this as matrix multiplication.

You can then also collect up the sufficient statistics on the y side. That is, the number of obs_index_in_flat_mu that you have with a given index. You only need a count of those in y, then it can be binomial_logit. That’s the speedup I talked about as I wasn’t assuming there’d be much overlap in the predictors.

Ah, yes, that’s an additional speedup that can be added for a binomial case like this (and one I keep forgetting about!). The speedup from avoiding redundant computations amidst repeated observations applies more generally to any outcome type.

Neat! I’ll try these additional tweaks and report back on the likely further speedups.

Can you elaborate what you mean here?

Yes. This operation

matrix[rows_w,n_s] mu ;
  for(i in 1:n_s){
    for(j in 1:rows_w){
      mu[j,i] = dot_product(s_vals[i],w[j]) ;
    }
  }

should be formulated as a matrix multiply w * s_vals, which will require some rearranging of the data types for s_vals or w or a copy into two matrixes.

Ah, yes, defining/computing s_vals as:

// s_vals: subject-by-subject values
matrix[n_w, n_s] s_vals ;
s_vals = transpose(b * coef + transpose(diag_pre_multiply(sds,cor_helper) * multi_normal_helper) );

one can then simply do:

y ~ bernoulli_logit( to_vector(w*s_vals)[obs_index_in_flat_mu] ) ;

in the model block. Achieves about 10% speedup over the non-matrix multiplication version.