Convergence issues for 3-level multivariate linear model

I am relatively new to Stan, which I am using to estimate a mixed effects multivariate model with 3 levels of nested effects. I have some questions about the estimation and model diagnosis, and would appreciate any insights from this community.

FIrst, my model; I have 6 Ys; my observations are nested within patients, which are nested within teams:

data {
  int<lower=1> K;   // number of Ys  (6)
  int<lower=1> J;   // number of Xs  (2 for testing)
  int<lower=0> N;   // number of obs
  int<lower=1> M;   // number of patients
  int<lower=1> T;   // number of teams
  
  vector[J] x[N];   // Xs
  vector[K] y[N];   // Ys
  int pp[N];        // list of patient IDs
  
  int<lower=1> teamlu[M];   // team lookup
 
}
parameters {
  matrix[K,J]  betaX;             // K ys, J xs, K x J = # betas
  vector[K]  beta0;               // beta0 - intercept 
  
  // level one errors
  cholesky_factor_corr[K] E_Omega;            
  vector<lower=0>[K] e_sigma;    
  
  // level 2 random effect
  vector[K] u_tp[M];
  vector<lower=0>[K] sigma_Utp;
  corr_matrix[K] OmegaM;           
  
  // level 3 random effect
  vector[K] u_t[T];                
  vector<lower=0>[K] sigma_Ut;     
  corr_matrix[K] OmegaT;           
}
transformed parameters {
  // varying intercepts
  // level 2
  vector[K] beta0_tp[M];
  // level 3
  vector[K] beta0_t[T];
  
  vector[K] mu[N];
  // level 3
  for (t in 1:T) {
    beta0_t[t] = beta0 + u_t[t];
  }
  
  // level 2
  for (m in 1:M) {
    beta0_tp[m] = beta0_t[teamlu[m]]+u_tp[m];
  }
 
  // mu for the model
  for (n in 1:N) {
    mu[n] = beta0_tp[pp[n]] + betaX*x[n];
  }
}

model {
 // patients
 OmegaM ~ lkj_corr(K);
 sigma_Utp ~ exponential(1);

 // Teams
 OmegaT ~ lkj_corr(K);
 sigma_Ut ~ exponential(1);

 // epsilons
 matrix[K,K] E_Sigma;
 E_Sigma = diag_pre_multiply(e_sigma,E_Omega);
 E_Omega ~ lkj_corr_cholesky(K);            
 e_sigma ~ cauchy(0,5);
 
 vector[K] nullK=rep_vector(0,K);
  
 to_vector(betaX) ~ normal(0,5);  // weak informative prior
 beta0 ~ normal(0,5);            // weak informative prior
 
 u_t ~ multi_normal(nullK,quad_form_diag(OmegaT,sigma_Ut));
 u_tp ~ multi_normal(nullK,quad_form_diag(OmegaM,sigma_Utp));
 
  
 y ~ multi_normal_cholesky( mu, E_Sigma);
}

generated quantities {
  matrix[K, K] Omega;
  matrix[K, K] Sigma;
  Omega = multiply_lower_tri_self_transpose(E_Omega);
  Sigma = quad_form_diag(Omega, e_sigma); 
}

I am estimating this model using -rstan-; my test sample has ~13000 obs, nested in ~3000 patients, nested in 15 teams. There are 32 cores so I use 600 iterations per chain, with 400 of those being warmup, leaving me with 200x32 = 6400 iterations for inference.

Three questions:

  1. Least important question: Is there an obvious way to speed things up? This takes 8 hours to run on a 3.Ghz i9 with 32 cores and 128gb ram; which would be fine except I’m working with about 1/4 of the full dataset, and I have some additional features to add to the model. (As an aside - the processor has 24 actual cores but R’s detect.cores() finds the 32 logical processors, any advantage to using 24 instead of 32?)

  2. Moderately important question: I get a warning that there is 1 divergent transition. I understand that divergent transitions indicate model misspecification, but is 1 enough to be concerned? Web searches turn up different answers to this question.

  3. Most important question: I get these three warnings:

  • The largest R-hat is NA, indicating chains have not mixed.
  • Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
  • Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
    I understand these are largely the same issue, and all occur for E_Omega, the intermediate error matrix - here are the first few elements of the 6x6 matrix:
             mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
E_Omega[1,1] 1.00     NaN 0.00 1.00 1.00 1.00 1.00  1.00   NaN  NaN
E_Omega[1,2] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
E_Omega[1,3] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
E_Omega[1,4] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
E_Omega[1,5] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
E_Omega[1,6] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
E_Omega[2,1] 0.33    0.00 0.02 0.31 0.32 0.33 0.33  0.42    20 2.28
E_Omega[2,2] 0.94    0.00 0.01 0.91 0.94 0.94 0.95  0.95    19 2.60
E_Omega[2,3] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
E_Omega[2,4] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
E_Omega[2,5] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
E_Omega[2,6] 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN

Since my interest is in the betas, do I need to worry about this? If I do need to worry about this, how can I best figure out what the problem is? I have used -pairs()- and some -bayesplot- utilities to look at what is going on, but this is a 6x6 matrix, and it difficult to sort out what is driving what with so many pairwise comparisons. I did simplify the model to have only 3 Y variables, thinking it would be easier to troubleshoot, but then I got no warnings of any sort (yay?). Do I just need to increase the iterations?

Thanks in advance for any insights.

Jeph

How about canceling the covariance matrix in multivariate normal distribution?
Also, brms is capable of fitting three-level hierarchical model and multivariate model, surly using brms can avoid some potential coding errors.

Thanks for the suggestions. I started with -brms-, thinking it would at least provide a benchmark. I haven’t used it before, but I think this is the same model as mine:

model<-bf(mvbind(y1,y2,y3,y4,y5,y6) ~ x1 + x2 + (1|p|pat) + (1|q|pat:team))
fit <- brm(model,data=df,chains=24)

I used the -brms- defaults of 2000 iterations, with 1000 warmup. This produced 1095 divergent transitions, and 3 additional warnings about Rhats being too large. Is the above specification incorrect?

Seems fine for me. Considering you have 24000 samples in total, 1095 divergent transitions is not a very shocking number. Maybe increase the iteration number and burn-in number can solve this warning.