How to improve model speed as the number of datapoints increase

Hello, First time poster here. I am analyzing data from a psychology study. I have a model that works well on a 100 participant study which I conducted in person. I hosted an online version and was able to get 500 participants (great for power!) but I’ve had a lot of difficulty fitting the model to the larger dataset. For troubleshooting here I can only provide partial data (50 participants) because the data is unpublished, therefore my main problem (scaling) is not totally reproducible, my apologies.

I am using STAN with cluster computing with the package cmdstan, therefore I have to set a timeout when I start a model simulating. When I run the model on the smaller dataset it converges and takes about 15 hours to fit. When I ran the model on the larger dataset it made it 20% of the way through in 40 hours, so I projected that it would take 200 hours to fit. As you can imagine that’s a challenging turn around time so I am looking for advice about how to make the model more efficient. At this point I’m pretty sure that my priors are in the right range and that the model could fit if it had enough time.

data {
	int NS;
	int MT;
	int NT[NS];
	int choose_stay[NS,MT];
	matrix[NS,MT] total_reward;  
	matrix[NS,MT] total_harvest_periods;  
	matrix[NS,MT] number_trees; 
	matrix[NS,MT] expected_reward;
	matrix[NS,MT] is_1back; 
	matrix[NS,MT] is_3back;  
	matrix[NS,MT] is_small;  
	matrix[NS,MT] is_large;  
}

parameters {
  vector[5] beta_ms; // group betas
  vector[5] beta_s[NS]; //per subject betas
  vector<lower=0>[5] tau;	// betas covariance scale
  corr_matrix[5] omega;   // betas covariance correlation 
  // beta_ms[1] = inv_temp 
  // beta_ms[2] = cost_1back
  // beta_ms[3] = cost_3back
  // beta_ms[4] = cost_small
  // beta_ms[5] = cost_large
}

transformed parameters {
	matrix[5,5] sigma;
	sigma = quad_form_diag(omega,tau); // covariance matrix
}

model {
  omega ~ lkj_corr(1); // prior on correlation
	tau ~ normal(30,30); // prior on covariance scale
	beta_ms ~ normal(50, 50); // prior on group betas
	for (s in 1:NS) {
	  beta_s[s] ~ multi_normal(beta_ms, sigma);
		for (t in 1:NT[s]) {
      choose_stay[s,t] ~ bernoulli_logit((beta_s[s][1]/100)*(expected_reward[s,t] -
                                  is_1back[s,t]*((total_reward[s,t] - beta_s[s][2]*number_trees[s,t])/total_harvest_periods[s,t]) -
                                  is_3back[s,t]*((total_reward[s,t] - (beta_s[s][2]+(beta_s[s][3]/10))*number_trees[s,t])/total_harvest_periods[s,t]) -
                                  is_small[s,t]*((total_reward[s,t] - beta_s[s][4]*number_trees[s,t])/total_harvest_periods[s,t]) -
                                  is_large[s,t]*((total_reward[s,t] - (beta_s[s][4]+(beta_s[s][5]/10))*number_trees[s,t])/total_harvest_periods[s,t])));
		}
	}
}

I am attaching the subset of the data that I can share as a .csv, the python file I use to prepare the data and call the STAN model, and that .stan model file itself. Thanks for any ideas you have!

fit_stan_model.stan (1.8 KB)

fit_stan_50subs_data.csv (2.8 MB)

fit_stan_cmdstan_troubleshooting.py (3.3 KB)

1 Like

First, I strongly suspect that the non-centered parameterization will sample faster given the binomial outcome. Here’s the relevant section of the Stan User Guide for how to achieve that.

I also suspect that your data can be restructured to long format, which would enable vectorization of a bunch of your computations via rows_dot_product() as well as vectorizing the final likelihood. See here for a work-in-progress repo where I’m posting highly-optimized hierarchical models. I think for your model would be a deviation from the “hwb” variant (hierarchical, within-subjects manipulations only, bernoulli outcome).

I take that back about non-centered likely sampling faster; looking at your data it seems that there is ~700 observations of each of your 500 participants, so the likelihood will dominate the posterior geometry and centered is likely best (though I think multi_normal_cholesky() is faster than your multi_normal()). I also notice that you don’t have much in the way of redundancy in the data across rows, so you can’t use the sufficient stats trick to speed things up either. Would you mind describing the data/experiment a bit?

Thanks for these ideas! So the data is from a ‘patch foraging’ game and is as follows: participants are ‘harvesting’ in a virtual ‘orchard’, on every trial they can harvest a ‘tree’ for ‘apples’ or they can move onto another tree. If they stay in a tree then choose_stay == 1 and if they leave choose_stay == 0. So an important column is ‘expected_reward’ which is how many apples they are expecting on a trial. Then my model posits they they compare the ‘expected_reward’ to the average reward rate of the block. The average reward rate is defined the terms you see like:

total_reward[s,t] - beta_s[s][2]*number_trees[s,t])/total_harvest_periods[s,t]

There are 4 block types and which betas I use depend on the block type. The block types are manipulations of which task participants had to complete when they moved between trees. The block types have indicators (i.e. is_1back[s,t]) that are 1 when present on a given trial and 0 when not present, so in my model formula a lot of terms zero out. I have wondered if that could be more efficient.

1 Like

Give this a try:

//aria:compile=0
data {
	int NS;
	int K;
	int choose_stay[K];
	int<lower=1,upper=NS> which_S[K];
	vector[K] total_reward;
	vector[K] total_harvest_periods;
	vector[K] number_trees;
	vector[K] expected_reward;
	vector[K] is_1back;
	vector[K] is_3back;
	vector[K] is_small;
	vector[K] is_large;
}

parameters {
	vector[5] beta_ms; // group betas
	matrix[NS,5] beta_s; //per subject betas
	vector<lower=0>[5] tau;	// betas covariance scale
	cholesky_factor_corr[5] omega;   // betas covariance correlation
	// beta_ms[1] = inv_temp
	// beta_ms[2] = cost_1back
	// beta_ms[3] = cost_3back
	// beta_ms[4] = cost_small
	// beta_ms[5] = cost_large
}

model {
	omega ~ lkj_corr_cholesky(1); // prior on correlation
	tau ~ normal(30,30); // prior on covariance scale
	beta_ms ~ normal(50, 50); // prior on group betas
	matrix[5,5] L_omega = diag_pre_multiply(tau, omega) ;
	for (s in 1:NS) {
		beta_s[s] ~ multi_normal_cholesky(
			beta_ms
			, L_omega
		);
	}
	choose_stay ~ bernoulli_logit(
		(beta_s[which_S,1]./100).*(
			expected_reward
			- is_1back.*(
				(total_reward - beta_s[which_S,2].*number_trees)./total_harvest_periods
			)
			- is_3back.*(
				(total_reward - (beta_s[which_S,2]+(beta_s[which_S,3]./10)).*number_trees)./total_harvest_periods
			)
			- is_small.*(
				(total_reward - beta_s[which_S,4].*number_trees)./total_harvest_periods
			)
			- is_large.*(
				(total_reward - (beta_s[which_S,4]+(beta_s[which_S,5]./10)).*number_trees)./total_harvest_periods
			)
		)
	) ;
}
generated quantities{
	// r: correlations
	matrix[5,5] r = multiply_lower_tri_self_transpose(omega) ;
}

Note that the data input has changed such that it expects that you do less data munging beforehand; just leave the data in the same long format as your csv. K is now the total number of rows, and which_S is a vector indicating the subject number of each row (so, should be [1,1,1,1,.....,2,2,2,2,2,.....500,500,500,500] if your rows are ordered by subject first). I’ve managed to vectorize the likelihood, so it should be faster, and you could also even use reduce_sum to speed up that part even more if you have access to lots of CPU cores.

But there’s still a lot of wasted computation going on in that model thanks to the is_1back*... - is_3back*... (i.e. when those variables are zero, everything inside the thing they’re being multiplied by is still computed, only to be zeroed). You might consider breaking the data/likelihood into 4 parts instead, but I feel like there’s a more elegant solution on the tip of my tongue…

2 Likes

Oops, I just realized that the multiplies in the likelihood need to be .* , else they’ll be treated as dot-products. fixing now… ok, fixed (also had to use ./)

Oh, and if these truly are your informed priors:

	tau ~ normal(30,30); // prior on covariance scale
	beta_ms ~ normal(50, 50); // prior on group betas

Then you’ll probably want to tell the sampler to apply accompanying offsets/multipliers when declaring the parameters:

parameters{
	vector<offset=50,multiplier=50>[5] beta_ms; // group betas
	vector<lower=0,offset=30,multiplier=30>[5] tau;	// betas covariance scale
	...
}

Though you then divide all the subject beta[1]'s by 100, beta[3] by 10 and beta[5] by ten. Hm. I guess composed transforms are still a WIP, so your best solution is:

//aria:compile=0
data {
	int NS;
	int K;
	int choose_stay[K];
	int<lower=1,upper=NS> which_S[K];
	vector[K] total_reward;
	vector[K] total_harvest_periods;
	vector[K] number_trees;
	vector[K] expected_reward;
	vector[K] is_1back;
	vector[K] is_3back;
	vector[K] is_small;
	vector[K] is_large;
}
transformed data{
	vector[5] tau_prior_OM = [.3,30,3,30,3]';
	vector[5] beta_prior_OM = [.5,50,5,50,5]';

}
parameters {
	vector<offset=beta_prior_OM,multiplier=beta_prior_OM>[5] beta_ms; // group betas
	vector<lower=0>[5] tau_raw ;	// betas covariance scale
	matrix[NS,5] beta_s; //per subject betas
	cholesky_factor_corr[5] omega;   // betas covariance correlation
	// beta_ms[1] = inv_temp
	// beta_ms[2] = cost_1back
	// beta_ms[3] = cost_3back
	// beta_ms[4] = cost_small
	// beta_ms[5] = cost_large
}
transformed parameters{
	vector[5] tau = tau_raw .* tau_prior_OM ;
}
model {
	omega ~ lkj_corr_cholesky(1); // prior on correlation
	tau_raw ~ normal(1,1); // prior on covariance scale
	beta_ms ~ normal(beta_prior_OM, beta_prior_OM); // prior on group betas
	matrix[5,5] L_omega = diag_pre_multiply(tau, omega) ;
	for (s in 1:NS) {
		beta_s[s] ~ multi_normal_cholesky(
			beta_ms
			, L_omega
		);
	}

	choose_stay ~ bernoulli_logit(
		(beta_s[which_S,1]).*(
			expected_reward
			- is_1back.*(
				(total_reward - beta_s[which_S,2].*number_trees)./total_harvest_periods
			)
			- is_3back.*(
				(total_reward - (beta_s[which_S,2]+(beta_s[which_S,3])).*number_trees)./total_harvest_periods
			)
			- is_small.*(
				(total_reward - beta_s[which_S,4].*number_trees)./total_harvest_periods
			)
			- is_large.*(
				(total_reward - (beta_s[which_S,4]+(beta_s[which_S,5])).*number_trees)./total_harvest_periods
			)
		)
	) ;
}
generated quantities{
	// r: correlations
	matrix[5,5] r = multiply_lower_tri_self_transpose(omega) ;
}
1 Like

Hi Mike, thank you sooo much for re-writing the script. I apologize for the delay, I wanted to make sure I really understood the new model in case I needed to ask more questions. I’ve read up on the changes you made and validated that the new model gives similar results to the original (there are small changes in the values, but I think it’s because the MCSE is lower in your new model, so the values might even be more accurate in the new one). I found that it gave a 50% reduction in computing time which is a big deal! I am also finding that the computation time is similar for 50 participants as for 100 participants, next up I’m going to try the whole group, wish me luck! Best wishes, Laura