Are divergences acceptable in this case?

I have a large hierarchical regression stan model that essentially integrates four data sets to have a partially pooled regression across 7 time points. I’m integrating two large public opinion panel surveys with dropouts, different sampling structures, dates of administrations, distance between waves of survey, but the same questionnaire. This is real life data and I can’t get more because the survey was in 2016. Eventually, I will poststratify the data as it is survey data that isn’t completely representative and the population parameters are of interest. I keep getting divergences in the model. I have 5 chains with 9000 iterations, 2500 warmup, adapt_delta is .99, and tree_depth is 15. So I have 37500 post warmup. I split the data into testing and training sets.
I have a model that build 32 covariance matrices per data set (because the data is only normal within demographic groups) on the repeated measures and that model has 1521 divergent transitions and a simpler model with one covariance matrix that has 988 divergent transitions. The divergences occur throughout the chain, so a longer warmup would not fix it. If you count each element of the covariance matrix as one parameter, there are 2978 parameters so this is a very large model. I know divergences are a red flag that the model had some issues but I’m not sure those issues are biasing the regression coefficients. The model is complicated because most of the data is repeated measurements. The repeated measures are obviously correlated but it is hard to estimate the covariance matrices. I haven’t tried a non-centered parameterization but I don’t know how else I could fix it. The problem is likely estimating the covariance matrices some of which have very few data points and I’m not sure a non-centered parameterization would fix the divergence likely relating to the covariance matrices. Should I try a different prior on the covariance matrices so that I would have better performance when there are few or no data points to estimate that covariance? I’ve tried LKJ(1) and LKJ(2) and LKJ(2) performed better.

I think this might be a situation where there isn’t perfect model for the data that would not diverge. I’ve tried a cauchy prior on the variances but that diverged more. I’ve also played with all the hyperparameters of the priors. So far it appears that the effects on the regression coefficients are minimal because the pairs() plots say the draws for the coefficients for the divergent transitions don’t look any different than the non-divergent transitions. Plots of the residuals on the testing dataset look normal with homogenous variances which is exactly what they should be. There is minimal differences between the residuals of the testing data set and training data set.

I’ve considered simulating data from this model but I’m a little hesitant to do so because this model is complicated and takes days to run.

I will actually fit this model 16 times so I want to be sure what I have is good before I start running all 16 versions. Basically, I want a second opinion on whether the divergences make the model invalid if the fits of the parameters of interest seem to be unaffected and unbiased. Also this model performs better (lower residuals) then fitting single regression models at each time point which I think is another sign the model is working.

Below is the code:

// The input data is a vector 'y' of length 'N'.
data {
  int<lower=0> N; // total sample size
  int<lower=1> D; // dimension of non-demographic covariates
  int<lower=1> K; // number of demographic covariates
  int<lower=1> T; //number of time points
  //vector[N] S; // survey group, S = 1 for T1 only, S = 2 for T1+T7, S=3 for T2-T7, S=4 T7 only
  //vector[N]  C; // covariance matrix group
  //matrix[T, N] y; // response matrix 
  //matrix[D,T] x[N]; // data matrix
  //group 1
  int<lower=1> N_g1; // sample size group 1
  vector[D] x_g1[N_g1]; // covariates group 1
  vector[K]  cvec_g1[N_g1]; // demographic covariates group 1
  real y_g1[N_g1]; // response group 1
  //group 2
  int<lower=1> N_g2; // sample size group 2
  matrix[D,2] x_g2[N_g2]; // covariates group 2
  vector[K]  cvec_g2[N_g2]; // demographic covariates group 2
  vector[2] y_g2[N_g2]; // response group 2
  //group 3
  int<lower=1> N_g3; // sample size group 3
  matrix[D,6] x_g3[N_g3]; // covariates group 3
  vector[K]  cvec_g3[N_g3]; // demographic covariates group 3
  vector[6] y_g3[N_g3]; // response group 3
  // group 4
  int<lower=1> N_g4; // sample size group 4
  vector[D] x_g4[N_g4]; // covariates group 4
  vector[K]  cvec_g4[N_g4]; // demographic covariates group 4
  real y_g4[N_g4]; // response group 4
  // vector of demographic group
  int<lower=1> C_g1[N_g1];
  int<lower=1> C_g2[N_g2];
  int<lower=1> C_g3[N_g3];
  int<lower=1> C_g4[N_g4];


// The parameters accepted by the model. Our model
parameters {
  vector[K] alpha; // incenterepts
  matrix [D,T] beta; // regresion coefficients
  vector[D] mu_d; // mean of beta
  vector<lower=0>[K] sigma_alpha; //variance of demographic coefficients
  vector<lower=0>[D] sigma_beta; //variance of non-demographic coefficients
  cholesky_factor_corr[2] group2cor[32]; // covariance of matrix for people observed 2
  cholesky_factor_corr[6] group3cor[32]; // covariance of matrix for people observed 6
  vector<lower=0>[2] sigmag2[32];
  vector<lower=0>[6] sigmag3[32];
  real<lower=0> group1var[32];
  real<lower=0> group4var[32];
  //real mu;
  //real<lower=0> sigma;
transformed parameters{
  //int my_subset[2] = {1,7};
  cholesky_factor_cov[2] covg2[32];
  cholesky_factor_cov[6] covg3[32];
  for(k in 1:32) covg2[k] = diag_pre_multiply(sigmag2[k],group2cor[k]);
  for(k in 1:32) covg3[k] = diag_pre_multiply(sigmag3[k],group3cor[k]);
// The model to be estimated. We model the output
// 'y' to be normally distributed with mean 'mu'
// and standard deviation 'sigma'.
model {
  //y ~ normal(mu, sigma);
  for(i in 1:K){
    sigma_alpha[i] ~ normal(0,1);
  for(i in 1:D){
    sigma_beta[i] ~ normal(0,1);
  for(i in 1:K){
    alpha[i] ~ normal(0, sigma_alpha[i]);
  for(i in 1:D){
    mu_d[i] ~ normal(0, 1);
    beta[i,:] ~ normal(mu_d[i], sigma_beta[i]);
  for(i in 1:32){
    group2cor[i] ~ lkj_corr_cholesky(2);
    group3cor[i] ~ lkj_corr_cholesky(2);
    sigmag2[i] ~ normal(0,1);
    sigmag3[i] ~ normal(0,1);
    group1var[i] ~ normal(0,1);
    group4var[i] ~ normal(0,1);
  for(i in 1:N_g1){
    int c = C_g1[i];
    y_g1[i] ~ normal(dot_product(cvec_g1[i], alpha)+dot_product(x_g1[i],beta[:,1]), group1var[c]);
  for(i in 1:N_g2){
    int c = C_g2[i];
    vector[2] mu2;
    mu2[1] = dot_product(cvec_g2[i], alpha)+x_g2[i][:,1]'*beta[:, 1];
    mu2[2] = dot_product(cvec_g2[i], alpha)+x_g2[i][:,2]'*beta[:, 7];
    y_g2[i] ~ multi_normal_cholesky(mu2, covg2[c]);
  for(i in 1:N_g3){
    int c = C_g3[i];
    vector[6] mu3;
    mu3[1] = dot_product(cvec_g3[i], alpha)+x_g3[i][:,1]'*beta[:, 2];
    mu3[2] = dot_product(cvec_g3[i], alpha)+x_g3[i][:,2]'*beta[:, 3];
    mu3[3] = dot_product(cvec_g3[i], alpha)+x_g3[i][:,3]'*beta[:, 4];
    mu3[4] = dot_product(cvec_g3[i], alpha)+x_g3[i][:,4]'*beta[:, 5];
    mu3[5] = dot_product(cvec_g3[i], alpha)+x_g3[i][:,5]'*beta[:,6];
    mu3[6] = dot_product(cvec_g3[i], alpha)+x_g3[i][:,6]'*beta[:,7];
    y_g3[i] ~ multi_normal_cholesky(mu3, covg3[c]);
  for(i in 1:N_g4){
    int c = C_g4[i];
    y_g4[i] ~ normal(dot_product(cvec_g4[i], alpha)+dot_product(x_g4[i],beta[:,7]), group4var[c]);

1 Like

I think the short answer here is that you do have to worry about the divergences.

In general with divergences (and this case is no exception), the issue is that in the presence of divergences you have absolutely no assurance that you have explored the posterior sufficiently. It’s hypothetically possible that inference from your model closely approximates inference from the true posterior. It’s also hypothetically possible that inference from your model is wildly wrong.

In fact you can never be completely assured (except in smallish subset of well-studied models or possibly via extensive simulation-based-calibration) that your posterior inference is correct, but divergences are always a huge red flag that there are probably regions of the posterior that you have failed to adequately explore. Because you haven’t explored them adequately, you don’t know how big or important those regions are, or indeed whether there are still more parts of the posterior where the geometry is ok but they are invisible to you because they are “behind” the parts of the posterior where the model diverges.

Noncentering seems like a reasonable place to start to see if you can squash the divergences. Working with a simplified model for a subset of the data might also help.


Just chiming in to support @jsocolar’s opinion that yes, you need to worry about the divergences. I do tons of repeated-measures work and non-centered typically do perfectly fine, so give that a try.


When the data is sparse you need strong priors.
Let’s look at this part of the model:

parameters {
  vector<lower=0>[2] sigmag2[32];
model {
  for(i in 1:32){
    sigmag2[i] ~ normal(0,1);

With a total of 64 sigmas drawn independently the prior predicts that often the largest sigmag2 is over 2.5 and the smallest sigmag2 is less than 0.02. Is this spread reasonable?
You should hierarchical structure so that the groups can borrow information from each other. Something like

parameters {
  vector<lower=0>[2] sigmag2_population;
  vector<lower=0>[2] sigmag2[32];
model {
  sigmag2_population ~ normal(0,1);
  for(i in 1:32){
    sigmag2[i] ~ lognormal(log(sigmag2_population),0.1);

I’ve added the hierarchical structure to the priors on the variances as @nhuurre suggests and I am working on applying a noncentered parameterization. Should I use the noncentered parameterization on the likelihood or just the normal priors on or both? Is there a particular sign (r_hat, n_eff, something on the pairs plot) a parameter is problematic? I’m guessing the (co)variance is the problem because I know the data is sparse on the covariance matrices but I’m not really sure.

For guidance on how to find the source of your divergences and resolve the underlying problem see Hierarchical Modeling. Keep in mind that centering/non-centering has to be done on a context-by-context basis (keep any contexts with strongly informative likelihood functions centered) and that degeneracies between the population parameters (population location, scale, and correlations if working with multivariate hierarchies) can be resolved only with more information (more data or stronger priors on those population parameters).