Hello, First time poster here. I am analyzing data from a psychology study. I have a model that works well on a 100 participant study which I conducted in person. I hosted an online version and was able to get 500 participants (great for power!) but I’ve had a lot of difficulty fitting the model to the larger dataset. For troubleshooting here I can only provide partial data (50 participants) because the data is unpublished, therefore my main problem (scaling) is not totally reproducible, my apologies.
I am using STAN with cluster computing with the package cmdstan, therefore I have to set a timeout when I start a model simulating. When I run the model on the smaller dataset it converges and takes about 15 hours to fit. When I ran the model on the larger dataset it made it 20% of the way through in 40 hours, so I projected that it would take 200 hours to fit. As you can imagine that’s a challenging turn around time so I am looking for advice about how to make the model more efficient. At this point I’m pretty sure that my priors are in the right range and that the model could fit if it had enough time.
data {
int NS;
int MT;
int NT[NS];
int choose_stay[NS,MT];
matrix[NS,MT] total_reward;
matrix[NS,MT] total_harvest_periods;
matrix[NS,MT] number_trees;
matrix[NS,MT] expected_reward;
matrix[NS,MT] is_1back;
matrix[NS,MT] is_3back;
matrix[NS,MT] is_small;
matrix[NS,MT] is_large;
}
parameters {
vector[5] beta_ms; // group betas
vector[5] beta_s[NS]; //per subject betas
vector<lower=0>[5] tau; // betas covariance scale
corr_matrix[5] omega; // betas covariance correlation
// beta_ms[1] = inv_temp
// beta_ms[2] = cost_1back
// beta_ms[3] = cost_3back
// beta_ms[4] = cost_small
// beta_ms[5] = cost_large
}
transformed parameters {
matrix[5,5] sigma;
sigma = quad_form_diag(omega,tau); // covariance matrix
}
model {
omega ~ lkj_corr(1); // prior on correlation
tau ~ normal(30,30); // prior on covariance scale
beta_ms ~ normal(50, 50); // prior on group betas
for (s in 1:NS) {
beta_s[s] ~ multi_normal(beta_ms, sigma);
for (t in 1:NT[s]) {
choose_stay[s,t] ~ bernoulli_logit((beta_s[s][1]/100)*(expected_reward[s,t] -
is_1back[s,t]*((total_reward[s,t] - beta_s[s][2]*number_trees[s,t])/total_harvest_periods[s,t]) -
is_3back[s,t]*((total_reward[s,t] - (beta_s[s][2]+(beta_s[s][3]/10))*number_trees[s,t])/total_harvest_periods[s,t]) -
is_small[s,t]*((total_reward[s,t] - beta_s[s][4]*number_trees[s,t])/total_harvest_periods[s,t]) -
is_large[s,t]*((total_reward[s,t] - (beta_s[s][4]+(beta_s[s][5]/10))*number_trees[s,t])/total_harvest_periods[s,t])));
}
}
}
I am attaching the subset of the data that I can share as a .csv, the python file I use to prepare the data and call the STAN model, and that .stan model file itself. Thanks for any ideas you have!
fit_stan_model.stan (1.8 KB)
fit_stan_50subs_data.csv (2.8 MB)
fit_stan_cmdstan_troubleshooting.py (3.3 KB)