Three-level HLM taking a long time


I have a nested three-level model that is taking a prohibitively long time to run. I’m wondering if the long run-time has to do with the large amount of data or number of parameters being estimated, or if I’ve specified things poorly.

Some details:
‘A’ is the highest level, with 40 distinct groups (N_A = 40)
‘B’ is the next highest level, with 441 distinct groups (N_B = 441)
‘C’ is the lowest level, with 656 distinct groups (N_C = 656)
I have about 80,000 records (N ~ 80000). Every record falls into exactly one C, B, and A group.
They way it’s currently written, there end up being over 10,000 parameters that Stan estimates. My only goal is prediction, so if it is possible to cut down on the number of parameters, I’m sure that would help.
Running the model with the default adapt_delta = .8 has many divergences.
Increasing the adapt_delta to .99 helped some, though out of 175 post warmup samples, 21 were divergent and the remaining 154 reached the maximum tree_depth. The maximum tree_depth was already set at 15, and I fear that setting it any higher will only increase the run time.
As it is, it takes ~ 31 hours to do 100 iterations.

I’ve tried a non-centered parametrization as well, but that only took longer and had more divergences, and I’ve read that non-centered tends to be better when there isn’t much data which clearly isn’t the case here.
I’ve also tried vectorizing where possible, though that is difficult given the subscripts needed to identify groups.
Also, despite the for loops in the generated quantities section, I know that is not the cause of the long run-time.

If anyone can offer any insights on how I can speed up convergence, I would be much obliged.

  int<lower=0>  N;                    //number of observations
  int<lower=0>  N_new;                //number of hold out observations
  int<lower=1>  N_C;                  //number of C
  int<lower=1>  N_B;                  //number of B
  int<lower=1>  N_A;                  //number of A
  int<lower=0>  K;                    //number of covariates
  matrix[N,K]   x;                    //predictor matrix
  matrix[N_new,K] x_new;              //hold out predictors
  int<lower=1,upper=N_C> C[N];        //vector of C identifiers
  int<lower=0,upper=N_C> C_new[N_new]; //C identifiers for hold out
  int<lower=1,upper=N_B> C_B_id[N_C];  //vector of B identifiers for each C
  int<lower=1,upper=N_A> B_A_id[N_B];  //vector of A identifiers for each B
  vector[K]     zero;                 //zero vector 
  vector[N]     y;                    //response variable

  real<lower=0> sigma;                      //standard deviation
  real<lower=0> nu;                         //degrees of freedom for student_t
  real          kappa_alpha_A[N_A];     //mean for each A
  real          mu_alpha_B[N_B];        //particular B effect
  real<lower=0> sigma_mu_alpha_B[N_A];  //sd for B effect
  real          mu_alpha_C[N_C];        //particular C effect
  real<lower=0> sigma_mu_alpha_C[N_B];  //sd for C effect 
  vector[K]     kappa_beta_A[N_A];      //mean for each A
  vector[K]     mu_beta_B[N_B];         //particular B effect
  corr_matrix[K] Omega_B[N_A];          //corr for B cov matrix prior
  vector<lower=0>[K] tau_B[N_A];        //scale for B cov matrix prior
  vector[K]     mu_beta_C[N_C];         //particular C effect
  corr_matrix[K] Omega_C[N_B];          //corr for C cov matrix
  vector<lower=0>[K] tau_C[N_B];        //scale for C cov matrix

transformed parameters{
  real         alpha[N_C];
  vector[K]    beta[N_C];
  matrix[K, K] Sigma_B[N_A];                       //sigma B matrix
  matrix[K, K] Sigma_C[N_B];                       //sigma C matrix
  matrix[K,K]  Identity_10;                         
  for(i in 1:N_C){
    alpha[i] = kappa_alpha_A[B_A_id[C_B_id[i]]] + mu_alpha_B[C_B_id[i]] + mu_alpha_C[i]; 
    //alpha = A + B + C
  for(i in 1:N_C){
    beta[i] = kappa_beta_A[B_A_id[C_B_id[i]]] + mu_beta_B[C_B_id[i]] + mu_beta_C[i]; 
    //beta = A + B + C
  for(i in 1:N_A){
    Sigma_B[i] = quad_form_diag(Omega_B[i], tau_B[i]);
  for(i in 1:N_B){
    Sigma_C[i] = quad_form_diag(Omega_C[i], tau_C[i]);

  Identity_10 = diag_matrix(rep_vector(10, K) );

  //mean vector for vectorization, outside of student_t
  vector[N] mu;
  for(n in 1:N){
    mu[n] = alpha[C[n]] + x[n]*beta[C[n]];
  //params for student_t
  sigma ~ cauchy(0,2.5);                                //prior
  nu ~ gamma(2,0.1);                                    //prior
  sigma_mu_alpha_B ~ cauchy(0, 2.5);                  //hyperprior
  sigma_mu_alpha_C ~ cauchy(0, 2.5);                  //hyperprior
  for(i in 1:N_A){
    tau_B[i] ~ cauchy(0, 2.5);                        //hyperprior
  for(i in 1:N_A){
    Omega_B[i] ~ lkj_corr(2);                         //hyperprior
  for(i in 1:N_B){
    tau_C[i] ~ cauchy(0,2.5);                         //hyperprior
  for(i in 1:N_B){
    Omega_C[i] ~ lkj_corr(2);                         //hyperprior
  //mus and kappas
  for(i in 1:N_B){
    mu_alpha_B[i] ~ normal(0, sigma_mu_alpha_B[B_A_id[i]]);
  for(i in 1:N_C){
    mu_alpha_C[i] ~ normal(0, sigma_mu_alpha_C[C_B_id[i]]);
  for(i in 1:N_B){
    mu_beta_B[i] ~ multi_normal(zero, Sigma_B[B_A_id[i]]);
  for(i in 1:N_C){
    mu_beta_C[i] ~ multi_normal(zero, Sigma_C[C_B_id[i]]);
  kappa_alpha_A ~ normal(0,10);                      //hyperprior 
  kappa_beta_A ~ multi_normal(zero, Identity_10);    //hyperprior
  y ~ student_t(nu, mu, sigma);

generated quantities{
  vector[N_new]     y_new;     //replicated data
  vector[N]         y_trainrep;                 //replicated training data

  // can't vectorize here: student_t_rng won't take vector
  for(n in 1:N_new){
    y_new[n] = student_t_rng(nu, alpha[C_new[n]]
                             + x_new[n]*beta[C_new[n]], sigma);
  for(n in 1:N){
    y_trainrep[n] = student_t_rng(nu, alpha[C[n]]
                             + x[n]*beta[C[n]], sigma);

Hi, without deeply understanding your model, just some possible hints:

  • In my experience, 80 000 datapoints took hours in a more simple model, so the speed isn’t unbelievably slow
  • note that you can index by a vector, so instead of
  for(i in 1:N_B){
    mu_alpha_B[i] ~ normal(0, sigma_mu_alpha_B[B_A_id[i]]);

you can do
mu_alpha_B ~ normal(0, sigma_mu_alpha_B[B_A_id]);

  • sigma ~ cauchy(0,2.5) is very wide - see if you can justify 3-DOF student-t or even normal(0,x) (Andrew, Dan and Ben have some interesting discussion on this somewhere).
  • do you need all those transformed parameters? Moving the transformed params to model block will mean they will not be stored in memory and it is possible your performance is memory bound (if Stan runs out of RAM and starts swapping, it will be painfully slow). If the memory is still full after this, consider reducing the number of chains running in parallel.
  • You might want to check if your model can be expressed with INLA (, I have had some pleasant experiences with it recently and it scales way better than Stan, only does a limited class of models though. It is an approximation, but the approximation is very good in my experience.

Some of the approximations in INLA will be making their way into Stan at some point soonish. Not sure what the status is, but I know @anon75146577 and others have been working on it.

1 Like

It’s going sloooowwwwllllllllllyyyyyyyyyyyyyyy

Some of that discussion is summarized in this blog post from last week.

Thank you, Martin (and everyone else) for the replies. Your points about vectorization and narrowing the prior for sigma are well-taken; I will implement those changes now and check the the performance.

As for the transformed parameters, the reason most of them are there is because I want to use them for vectorization later. For instance, I create the transformed parameters alpha and beta here

… so I can use them in the model block here:

Is there perhaps a better way to create alpha and beta that I am missing?
Also, could I potentially just move the creation of alpha and beta as is to the modeling block, and not in the transformed parameters block? If so, that is something I did not know was allowed.

Yes, this was exactly my point - basically if you cut and paste the transformed parameters block to model (making sure all variable declarations are at the beginning) you get the same posterior, only the samples for transformed params will not be stored.

But I realized you probably should get the same effect without modifying the model by setting the pars option in the call to stan(). Not sure how that’s implemented internally, but it looks like it could mean that only samples for the parameters you specify will be stored, saving memory.

If only there were a reliable and fast approximation for implementing the reliable and fast approximations!

1 Like

Ok, awesome. Thank you for bringing this to my attention.
I’ll report back once I’ve tested things out.

Sorry for the lag in following-up.

I implemented the suggestions you put forward, but unfortunately they did not seem to help. I’ve posted the elapsed times below for the original version and then the new version with suggested changes.

original model (in seconds):
warmup sample
chain:1 74253.7 32090.4
chain:2 57899.1 34150.9
chain:3 83779.3 29293.9
chain:4 69947.8 35086.1
chain:5 84931.6 26174.0
chain:6 63339.0 37548.1
chain:7 79398.1 32131.2

new model with suggested changes (seconds):
warmup sample
chain:1 65407.6 6.4396
chain:2 84867.0 31042.6000
chain:3 72088.2 32880.4000
chain:4 72646.3 30519.5000
chain:5 73441.9 32122.6000
chain:6 63690.4 34782.9000
chain:7 70021.7 31895.9000

Clearly, chain 1 went divergent in the new model which explains the short sampling time.

In taking a closer look at the data, however, it’s come to my attention that there is a sparsity that I haven’t really noticed before. About 50% of the data has an A-B-C hierarchy that appears with 30 observations or less (and there are 85K observations in total). Could this sparsity play a role in the difficulties my model is having?

Beyond the suggestions I had above, I am way out of my depth :-) I can only offer some process advice to (hopefully) learn what’s going wrong:

  • Try fitting simulated datasets of various sizes. If the model recovers params (see the validation paper) and takes reasonable times when the size of the data is comparable to the real data, it may mean that the model is not a good fit for your data.
  • Simplify the model until it works quickly for simulated data (from the simpilified model) and then add features one by one to see where the problem is

IMHO it is possible that your model + dataset size is really not feasible with current version of Stan, so exploring alternatives might be necessary.

Hey @Carson_L, I’m short on time at the moment, but I’d recommend starting a fresh topic to get more eyes on this. Realistically, if you make it a relatively short post (linking back to this one as needed to avoid repeating) that should also help.

In my experience yes. I find that this kind of structure comes up all the time and is very difficult for Stan. It ends up that part of your model works well under the centered parameterization and part works well in the non-centered. I haven’t found a perfect solution, but the best suggestion I can offer is try to identify relevant pairs of parameters for a diagnostic - usually parameters and the ones directly above & below them in the hierarchy at different levels of sparsity. Then try both the centered and non centered parameterizations to figure out which ones have the least pathalogical geometries, and be comfortable with the model taking a long time. If you can simulate a smaller dataset and still reproduce the problem, you can shorten your iteration time which will make developing a lot faster.

Thank you to everyone for the feedback.

Unfortunately, the 80,000 data points is a smaller dataset than the real one I will need my model to run on (though it is real, not simulated data). I’ve done a bit of building up from smaller, simpler models and this seems to be the stage where things get stuck. I’m starting to believe @martinmodrak that Stan may not currently be up to the task of what I need, especially in light of @aaronjg 's experience as well.

All of these suggestions have been very helpful though, so thank you very much.

How small of a fraction are you working with? It may be slow, but if you are able to test and validate the model on a portion of the data set why not just let it run for a week or two?

The actual dataset has several million data points. The primary goal of this model is for prediction. I will have new data almost daily and will need to update my predictions accordingly, thus waiting several weeks for the model to run is not feasible.

Got it. One approach that can be used here is to run the full model to get an empirical distribution for the hyperparameters and then use importance sampling to simulate draws from the posterior distribution. However, the feasibility of this depends on the dimensions of the data you are updating and the stationarity of the hyperparameters.

That’s not a bad idea, though I will have to look into how feasible it is. I appreciate the continued suggestions. Thanks!