# Divergent Transitions with R_hat < 1.1

Hi,

I am trying to fit a model in stan, which is producing around 0.6% - 1% divergent transitions, but my coefficients are all significant with Bayes R_hat < 1.1. I even verified the coefficient plot using shinystan which shows all values are within range. So I am bit inquisitive about these divergences and how are these detected/calculated. I tried to fit the same model by writing my own code using gibbs sampling, but I am not sure how to calculate these divergences.

Rhat is a very weak diagnostic, so this behavior is not surprising. See https://betanalpha.github.io/assets/case_studies/divergences_and_bias.html for much more.

Thank you for your suggestion betanalpha. I tried increasing my adapt delta to 0.9999 and higher and also tried different step sizes but that is not solving the divergence issues. Also for my problem its difficult to reparametrize since its a multivariate problem.

‘’’
data {

int<lower=1> N;
int<lower=1> K;

matrix[N, K] x;
vector[K] zero_vec;
matrix[K, K] I_mat;

vector[N] Y_fit;

}

parameters {

//beta
vector alpha;
vector<lower=0, upper=1> [K] beta;

real mu_alpha;
vector<lower=0> [K] mu_beta;

real<lower=0> sigma_alpha;

cov_matrix [K] Var_Cov;
cov_matrix [K] Var_Cov_mu;

}

model {

mu_alpha ~ student_t(1, 2, 2);
sigma_alpha ~ cauchy(0, 2.5);

// Prior

beta ~ multi_normal(zero_vec, Var_Cov);
mu_beta ~ multi_normal(zero_vec, Var_Cov_mu/0.01);

Var_Cov ~ inv_wishart(K, I);
Var_Cov_mu ~ inv_wishart(K, I);

alpha ~ normal(mu_alpha, sigma_alpha);
sigma ~ normal(0,100);

for(i in 1:N){
Y_fit[i] ~ normal(alpha[i] + beta_std[i, 1] * x[i, 1] + beta_std[i, 2] * x[i, 2]
+ beta_std[i, 3] * x[i, 3], sigma);

}

}

‘’’

If you get that many divergences there is reason to investigate reasons for divergences, despite any Rhat values.

• In the case study @betanalpha linked, all the runs are only with 1 chain which makes it even more challenging for Rhat (but divergences are real also with just 1 chain).
• There is a new Rhat described in https://arxiv.org/abs/1903.08008. Which Rhat are you using?
• Old Rhat can fail and give values close to 1 if the distribution doesn’t have finite mean and variance. New Rhat works in those cases, too. See an example in the paper.
• 1.1 is ad-hoc threshold and even if we recommend stricter ad-hoc threshold in the new Rhat paper, it is better to look effective sample sizes and Monte Carlo standard errors as they are more interpretable. Rhat thresholds are useful for quick check (given no divergences)
• New Rhat, effective sample size and MCSE with multiple chains often detect also problems when there are divergences as demonstrated in the paper, but we also demonstrate in the online appendix https://avehtari.github.io/rhat_ess/rhat_ess.html that given a long enough diverging chains, the chains can sample from the same biased (truncated) target and thus new Rhat, effective sample size and MCSE are not able to detect problems while divergences (and sticking) are still strongly indication problems.

Hi Avehtari,

Thanks for your response. I am using the R_hat provided by stan in the summary statistics. For my case I am running say 2000 iterations with 500 warmup in 2 chains which is producing 20-40 divergence. So how critical is this for my model and coefficients?

We recommend at least 4 chains…

… although increasing the number of chains doesn’t help for divergences. That many divergences are critical in that sense that it is very difficult to know how wrong your estimates given those posterior draws are.

The model you posted doesn’t make sense. It has beta_std = mu_beta* const;, but const is not defined.

Hi Avehtari,

Sorry I had some typing mistake in the previous code, to be precise below is the code I am using

data {
int<lower=1> J;
int<lower=1> N;
int<lower=1> K;
matrix[N, K] x;
vector[K] zero_vec;
matrix[K, K] I_mat;
vector[N] Y_fit;
vector[N] b1;
vector[N] b2;
vector[K] up_b;

}

parameters {
vector alpha;
vector<lower=0, upper=1> [K] beta [J];
vector [J] beta_b1;
vector [J] beta_b2;
real mu_alpha;
vector [K] mu_beta;

real mu_beta_b1;
real mu_beta_b2;
real<lower=0> sigma_beta_b1;
real<lower=0> sigma_beta_b2;

real<lower=0> sigma_alpha;
cov_matrix [K] Var_Cov;
cov_matrix [K] Var_Cov_mu;
real<lower=0> sigma_y ;

}

transformed parameters {

vector [K] beta_std[J];
vector [J] beta_b1_std;
vector [J] beta_b2_std;

for (j in 1:J) beta_std[j] = beta[j] .* up_b ;

beta_b1_std = mu_beta_b1 + sigma_beta_b1 * beta_b1;
beta_b2_std = mu_beta_b2 + sigma_beta_b2 * beta_b2;
}

model {

mu_alpha ~ student_t(1, 2, 2);
sigma_alpha ~ inv_gamma(3,0.5);

// Prior
beta ~ multi_normal(mu_beta, Var_Cov);
mu_beta ~ multi_normal(zero_vec, Var_Cov_mu/0.01);

Var_Cov ~ inv_wishart(K,  I_mat);
Var_Cov_mu ~ inv_wishart(K,  I_mat);

alpha ~ normal(mu_alpha, sigma_alpha);
mu_beta_b1 ~ normal(0,10);
mu_beta_b2 ~ normal(0,10);

sigma_beta_b1 ~ inv_gamma(3,0.5);
sigma_beta_b2 ~ inv_gamma(3,0.5);

beta_b1 ~ std_normal();
beta_b2 ~ std_normal();

sigma_y ~ inv_gamma(3,0.5);

for(i in 1:N){
reg_log[i] ~ Y_fit[i] ~ normal(alpha[adj[i]] + beta_std[adj[i], 1] * x[i, 1] + beta_std[adj[i], 2] * x[i, 2]
+ beta_std[i, 3] * x[i, 3] + beta_b1_std[adj[i]] * b1[i]
+ beta_b2_std[adj[i]] * b2[i] , sigma_y);
}

}


As you can see I am using conjugate priors and the divergences that I am getting is due to the bounds/constraints that is used. I tried running this on 4 chains, 2000 iterations and 500 warmup which resulted in 33 divergence transitions.

Any help on how to solve this issue is appreciated.

Thank you.

You should definitely check out this part of the users guide. I bet using the non-centered parameterization will make sampling go faster and get rid of divergences.

See also Vats and Knudson, 2018 for a more rigorous threshold for Rhat.

1 Like

As @Max_Mantei notes the model is not properly non-centered. See the link to the Users Guide that he posted and continue down to the section on multivariate non-centering.

Vats and Knudson propose to choose Rhat going back from ESS, where desired minimum ESS is chosen by selecting "\epsilon —the relative volume of the confidence region—is akin to the
half-width of a confidence interval in sample size calculations for one sample t-tests." The selection of of specific value for \epsilon is also ad-hoc. We could do similar backward computation from any desired MCSE to (a certain version of) Rhat, but for me it seems easier to just focus then directly on MCSE and for that the user might have principled information what is the desired accuracy. Furthermore Vats and Knudson use only one chain, which makes Rhat weak. Also in my experiments the lugsail estimator they use for autocorrelation time is much worse than the currently used Geyer’s initial positive sequence rule, especially for shorter chains and for antithetic chains.

I concur with the view that working on the MCSE scale (i.e. the actual scale) is better. While that’s certainly true for statisticians, practioners often want assurances their chains have converged. We know that’s not quite how that works, but this doesn’t remove the need for better (safer) cutoffs. In any case, the general recommendation to adopt Rhat < 1.01 seems to me to be a good one. Notice that there’s nothing stopping people from using your new Rhats with Vats and Knudson’s recommendation, even if they weren’t derived with your estimator in mind. The intention here was to provide a further reference for the OP to check for themselves.

Divergences are a diagnostic unique to HMC, so you couldn’t calculate them for Gibbs. The Gibbs sampler would very likely still struggle with the problematic regions of the posterior that lead to the divergences, you just wouldn’t have a diagnostic to help detect it.

Hi,

I tried fitting using non centered parameterization, but due to the constraints on variables I was getting error. But changing the prior distribution for some parameters worked for me.

Thank you

I’m thinking that maybe you should avoid the inverse-gamma and inverse-Wishart priors as they can cause problems. Also that multiplication by 100 looks iffy and the sigma ~ normal(0,100) looks like it could be too weak. More informative priors can sometimes help with convergence.