Speed of matrix multiplication versus multiplication loops

cjs_data.csv (6.3 KB)

Hello! I’ve been working in Stan for a few years now (mainly building mark-recapture models), and while I love this program there are many things I do not fully understand, particularly about speed and optimization. As I’ve started to build larger and more complex models, model run times are getting excessive (e.g., weeks) and I’m wondering if there’s somewhere in my code where I am unknowingly being inefficient.

Specifically, since lots of my code involves matrix math, I wanted to know the most efficient way to do this. There seems to be a lot of discussion on how it is more efficient to use matrices and vectorization, however, I find matrix multiplication to be much slower in Stan than just running loops (see simple example below). Is there something I am doing inefficiently? The dataset I am actually fitting is much larger than this toy example (my actual code has 25 states - 42 occasions - and additional state uncertainty -I always run it using multiplication loops as the matrix math is too slow), so any speed-up tips would be really useful. Also - would using sparse matrices be helpful for this type of model?

The simple model code I have provided is a Cormack-Jolly-Seber (CJS) model to estimate survival and capture probability for animals. Animals in the study area are vulnerable to capture during multiple, discrete capture occasions. When an animal is captured, it is given an individually-identifiable mark so that it can be monitored through time. In Stan, we use a marginalized likelihood framework described in Yackulic et al. (2020) - and my example code follows this framework. I am using rstan v2.21.2

Thank you so much! I’ve attached the simulated data I used for this example as a .csv file (first 10 columns = sumCH, 11th column = sumFR, 12th column = sumf), NsumCH=dim(sumCH)[1], Nocc=10.

Maria

Yackulic, Charles B., et al. “A need for speed in Bayesian population models: a practical guide to marginalizing and recovering discrete latent states.” Ecological Applications 30.5 (2020): e02112.

/////////////////////////////////////////////////////////////////////////////////////////////////////
//example 1- matrix multiplication using loops

data{

  int NsumCH; //number of unique capture histories
  int Nocc; //number of occasions
  int sumCH[NsumCH,Nocc]; //capture history matrix (1= captured, 2= unobserved)
  int sumFR[NsumCH]; //number of individuals with particular capture history 
  int sumf[NsumCH]; //first capture occasion for each capture history

}

parameters{
  real mu_ls; //mean survival (logit-scale)
  real<lower=0, upper =5> sd_ls; //sd survival (logit)
  vector[Nocc-1] z_ls; // temporal RE on survival (logit)
  real mu_lp; //mean capture probability
  real<lower=0, upper =5> sd_lp; //sd capture probability
  vector[Nocc-1] z_lp;//temporal RE on capture probability

}


transformed parameters{

   vector[Nocc-1] s; // survival (real scale)
   vector[Nocc-1] p; //capture probability (real scale)
   vector[2] phi[2,Nocc-1]; //transition matrix 
   vector[2] pmat[2,Nocc-1]; //capture probability matrix
   vector[2] temp;
   vector[2] pz[Nocc]; //marginalized likelihood 
   vector[NsumCH] llik; //log likelihood for each capture history at the last time period


   for(t in 1:(Nocc-1)){
	s[t] = inv_logit(mu_ls + sd_ls*z_ls[t]);
	p[t] = inv_logit(mu_lp + sd_lp*z_lp[t]);

        phi[1,t,1]=s[t];
        phi[1,t,2]=1-s[t];
        phi[2,t,1]=0;
        phi[2,t,2]=1;

        pmat[1,t,1]=p[t];
        pmat[1,t,2]=1-p[t];
	pmat[2,t,1]=0;
        pmat[2,t,2]=1;
	}

  for (i in 1:NsumCH){
        pz[sumf[i],1] = 1;
        pz[sumf[i],2] = 0;
	for(t in (sumf[i]+1):Nocc){
		for(j in 1:2){
			for(k in 1:2){
				temp[k] = pz[t-1,k]*phi[k,t-1,j]*pmat[j,t-1,sumCH[i,t]];}
			pz[t,j]= sum(temp);}}
	llik[i] = sumFR[i]*log(sum(pz[Nocc,1:2]));}

	}

model{

   mu_ls~normal(0,2);
   mu_lp~normal(0,2);
   z_ls~normal(0,1);
   z_lp~normal(0,1);

  target+=(sum(llik));

}

looped version took ~90 seconds

/////////////////////////////////////////////////////////////////////////////////////////////////////
//example 2- matrix multiplication
//note: I re-arranged some of the indices here so that I could use vectors/matrices for //multiplication

parameters{
  real mu_ls;
  real<lower=0, upper =5> sd_ls;
  vector[Nocc-1] z_ls;
  real mu_lp;
  real<lower=0, upper =5> sd_lp;
  vector[Nocc-1] z_lp;}


transformed parameters{

   vector[Nocc-1] s;
   vector[Nocc-1] p;
   matrix[2,2] phi[Nocc-1];
   vector[2] pmat[2,Nocc-1];
   vector[NsumCH] llik;
   vector[2] pz[Nocc];


   for(t in 1:(Nocc-1)){
	s[t] = inv_logit(mu_ls + sd_ls*z_ls[t]);
	p[t] = inv_logit(mu_lp + sd_lp*z_lp[t]);

        phi[t,1,1]=s[t];
        phi[t,2,1]=1-s[t];
        phi[t,1,2]=0;
        phi[t,2,2]=1;

        pmat[1,t,1]=p[t];
        pmat[2,t,1]=1-p[t];
	pmat[1,t,2]=0;
        pmat[2,t,2]=1;
	}

   for (i in 1:NsumCH){
        pz[sumf[i],1] = 1;
        pz[sumf[i],2] = 0;
	for(t in (sumf[i]+1):Nocc){
		pz[t] = (phi[t-1]*pz[t-1]).*pmat[sumCH[i,t],t-1,];}
	llik[i] = sumFR[i]*log(sum(pz[Nocc,]));}

	}

model{

   mu_ls~normal(0,2);
   mu_lp~normal(0,2);
   z_ls~normal(0,1);
   z_lp~normal(0,1);

   target+= sum(llik);}

Matrix multiplication version took ~200 seconds - quite a bit longer

EDIT by Aki: added code block ticks and removed the posting instructions from the end

1 Like

Maybe start using the profiler built into Stan ?

Thanks - I used the profiler in cmdstanr on my ‘bigger’ code- it did identify the pz calculation as the slowest part of the code (even when I use loops in lieu of matrix math). Does anyone have a solution for making this part faster? That profiler tool is useful, looking forward to using it more !

I won’t be able to look into you model in detail but I can offer a few comments –

Much of the intuition here is due to the fact that dense matrix operations are highly optimized on most processors. That said optimal performance might be realized only for matrices that are small enough (to fit into process registers or local caches) or large enough (for any loop overhead to build up).

Stan’s autodiff implementations for its linear algebra functions was relatively inefficient up until the most recent versions of Stan (which haven’t been exposed in RStan yet). In particular it didn’t utilize those hyper-efficient matrix operations when calculating gradients. That could complicate comparison if you’re using an older version of Stan.

Matrix operations may not be superior if the matrices are sparse. In that case a loop that avoids unnecessary operations can be much faster than a dense matrix operation that wastes a lot of time on multiplying and adding zeros.

It looks like you’re using the element wise product .* which may not be considered a matrix operation. The real efficiency gains come from matrix-vector products – for example you should able to combine the calculations of pz[t] and the sum over the first index into single matrix-vector (or maybe matrix-matrix) product that could be more efficient.

As you mentioned using cmdstanr, you may get substantial speedups using more compiler optimizations than what is enabled by the default, see Options for improving Stan sampling speed – The Stan Blog

1 Like