Three-level HLM taking a long time



I have a nested three-level model that is taking a prohibitively long time to run. I’m wondering if the long run-time has to do with the large amount of data or number of parameters being estimated, or if I’ve specified things poorly.

Some details:
‘A’ is the highest level, with 40 distinct groups (N_A = 40)
‘B’ is the next highest level, with 441 distinct groups (N_B = 441)
‘C’ is the lowest level, with 656 distinct groups (N_C = 656)
I have about 80,000 records (N ~ 80000). Every record falls into exactly one C, B, and A group.
They way it’s currently written, there end up being over 10,000 parameters that Stan estimates. My only goal is prediction, so if it is possible to cut down on the number of parameters, I’m sure that would help.
Running the model with the default adapt_delta = .8 has many divergences.
Increasing the adapt_delta to .99 helped some, though out of 175 post warmup samples, 21 were divergent and the remaining 154 reached the maximum tree_depth. The maximum tree_depth was already set at 15, and I fear that setting it any higher will only increase the run time.
As it is, it takes ~ 31 hours to do 100 iterations.

I’ve tried a non-centered parametrization as well, but that only took longer and had more divergences, and I’ve read that non-centered tends to be better when there isn’t much data which clearly isn’t the case here.
I’ve also tried vectorizing where possible, though that is difficult given the subscripts needed to identify groups.
Also, despite the for loops in the generated quantities section, I know that is not the cause of the long run-time.

If anyone can offer any insights on how I can speed up convergence, I would be much obliged.

  int<lower=0>  N;                    //number of observations
  int<lower=0>  N_new;                //number of hold out observations
  int<lower=1>  N_C;                  //number of C
  int<lower=1>  N_B;                  //number of B
  int<lower=1>  N_A;                  //number of A
  int<lower=0>  K;                    //number of covariates
  matrix[N,K]   x;                    //predictor matrix
  matrix[N_new,K] x_new;              //hold out predictors
  int<lower=1,upper=N_C> C[N];        //vector of C identifiers
  int<lower=0,upper=N_C> C_new[N_new]; //C identifiers for hold out
  int<lower=1,upper=N_B> C_B_id[N_C];  //vector of B identifiers for each C
  int<lower=1,upper=N_A> B_A_id[N_B];  //vector of A identifiers for each B
  vector[K]     zero;                 //zero vector 
  vector[N]     y;                    //response variable

  real<lower=0> sigma;                      //standard deviation
  real<lower=0> nu;                         //degrees of freedom for student_t
  real          kappa_alpha_A[N_A];     //mean for each A
  real          mu_alpha_B[N_B];        //particular B effect
  real<lower=0> sigma_mu_alpha_B[N_A];  //sd for B effect
  real          mu_alpha_C[N_C];        //particular C effect
  real<lower=0> sigma_mu_alpha_C[N_B];  //sd for C effect 
  vector[K]     kappa_beta_A[N_A];      //mean for each A
  vector[K]     mu_beta_B[N_B];         //particular B effect
  corr_matrix[K] Omega_B[N_A];          //corr for B cov matrix prior
  vector<lower=0>[K] tau_B[N_A];        //scale for B cov matrix prior
  vector[K]     mu_beta_C[N_C];         //particular C effect
  corr_matrix[K] Omega_C[N_B];          //corr for C cov matrix
  vector<lower=0>[K] tau_C[N_B];        //scale for C cov matrix

transformed parameters{
  real         alpha[N_C];
  vector[K]    beta[N_C];
  matrix[K, K] Sigma_B[N_A];                       //sigma B matrix
  matrix[K, K] Sigma_C[N_B];                       //sigma C matrix
  matrix[K,K]  Identity_10;                         
  for(i in 1:N_C){
    alpha[i] = kappa_alpha_A[B_A_id[C_B_id[i]]] + mu_alpha_B[C_B_id[i]] + mu_alpha_C[i]; 
    //alpha = A + B + C
  for(i in 1:N_C){
    beta[i] = kappa_beta_A[B_A_id[C_B_id[i]]] + mu_beta_B[C_B_id[i]] + mu_beta_C[i]; 
    //beta = A + B + C
  for(i in 1:N_A){
    Sigma_B[i] = quad_form_diag(Omega_B[i], tau_B[i]);
  for(i in 1:N_B){
    Sigma_C[i] = quad_form_diag(Omega_C[i], tau_C[i]);

  Identity_10 = diag_matrix(rep_vector(10, K) );

  //mean vector for vectorization, outside of student_t
  vector[N] mu;
  for(n in 1:N){
    mu[n] = alpha[C[n]] + x[n]*beta[C[n]];
  //params for student_t
  sigma ~ cauchy(0,2.5);                                //prior
  nu ~ gamma(2,0.1);                                    //prior
  sigma_mu_alpha_B ~ cauchy(0, 2.5);                  //hyperprior
  sigma_mu_alpha_C ~ cauchy(0, 2.5);                  //hyperprior
  for(i in 1:N_A){
    tau_B[i] ~ cauchy(0, 2.5);                        //hyperprior
  for(i in 1:N_A){
    Omega_B[i] ~ lkj_corr(2);                         //hyperprior
  for(i in 1:N_B){
    tau_C[i] ~ cauchy(0,2.5);                         //hyperprior
  for(i in 1:N_B){
    Omega_C[i] ~ lkj_corr(2);                         //hyperprior
  //mus and kappas
  for(i in 1:N_B){
    mu_alpha_B[i] ~ normal(0, sigma_mu_alpha_B[B_A_id[i]]);
  for(i in 1:N_C){
    mu_alpha_C[i] ~ normal(0, sigma_mu_alpha_C[C_B_id[i]]);
  for(i in 1:N_B){
    mu_beta_B[i] ~ multi_normal(zero, Sigma_B[B_A_id[i]]);
  for(i in 1:N_C){
    mu_beta_C[i] ~ multi_normal(zero, Sigma_C[C_B_id[i]]);
  kappa_alpha_A ~ normal(0,10);                      //hyperprior 
  kappa_beta_A ~ multi_normal(zero, Identity_10);    //hyperprior
  y ~ student_t(nu, mu, sigma);

generated quantities{
  vector[N_new]     y_new;     //replicated data
  vector[N]         y_trainrep;                 //replicated training data

  // can't vectorize here: student_t_rng won't take vector
  for(n in 1:N_new){
    y_new[n] = student_t_rng(nu, alpha[C_new[n]]
                             + x_new[n]*beta[C_new[n]], sigma);
  for(n in 1:N){
    y_trainrep[n] = student_t_rng(nu, alpha[C[n]]
                             + x[n]*beta[C[n]], sigma);


Hi, without deeply understanding your model, just some possible hints:

  • In my experience, 80 000 datapoints took hours in a more simple model, so the speed isn’t unbelievably slow
  • note that you can index by a vector, so instead of
  for(i in 1:N_B){
    mu_alpha_B[i] ~ normal(0, sigma_mu_alpha_B[B_A_id[i]]);

you can do
mu_alpha_B ~ normal(0, sigma_mu_alpha_B[B_A_id]);

  • sigma ~ cauchy(0,2.5) is very wide - see if you can justify 3-DOF student-t or even normal(0,x) (Andrew, Dan and Ben have some interesting discussion on this somewhere).
  • do you need all those transformed parameters? Moving the transformed params to model block will mean they will not be stored in memory and it is possible your performance is memory bound (if Stan runs out of RAM and starts swapping, it will be painfully slow). If the memory is still full after this, consider reducing the number of chains running in parallel.
  • You might want to check if your model can be expressed with INLA (, I have had some pleasant experiences with it recently and it scales way better than Stan, only does a limited class of models though. It is an approximation, but the approximation is very good in my experience.


Some of the approximations in INLA will be making their way into Stan at some point soonish. Not sure what the status is, but I know @Daniel_Simpson and others have been working on it.


It’s going sloooowwwwllllllllllyyyyyyyyyyyyyyy

Some of that discussion is summarized in this blog post from last week.


Thank you, Martin (and everyone else) for the replies. Your points about vectorization and narrowing the prior for sigma are well-taken; I will implement those changes now and check the the performance.

As for the transformed parameters, the reason most of them are there is because I want to use them for vectorization later. For instance, I create the transformed parameters alpha and beta here

… so I can use them in the model block here:

Is there perhaps a better way to create alpha and beta that I am missing?
Also, could I potentially just move the creation of alpha and beta as is to the modeling block, and not in the transformed parameters block? If so, that is something I did not know was allowed.


Yes, this was exactly my point - basically if you cut and paste the transformed parameters block to model (making sure all variable declarations are at the beginning) you get the same posterior, only the samples for transformed params will not be stored.

But I realized you probably should get the same effect without modifying the model by setting the pars option in the call to stan(). Not sure how that’s implemented internally, but it looks like it could mean that only samples for the parameters you specify will be stored, saving memory.


If only there were a reliable and fast approximation for implementing the reliable and fast approximations!


Ok, awesome. Thank you for bringing this to my attention.
I’ll report back once I’ve tested things out.