Divergent transitions in hierarchical logit

Dear Stan community,

I am trying to fit a hierarchical logit model while I learn to use Stan. I get the “divergent transition” error. I read the Stan manual, the guide to Stan’s warning, other posts, and Michael Betancourt’s case study on divergences and bias.

I am attaching below (and uploading) my stan codehlg.stan (497 Bytes) , and also a csv with the datasetdata_stan.csv (12.2 KB).

So basically, all (or almost) the transitions are diverging. Following Michael’s advice I tried:

• Longer chains.
• Increasing the adap_delta parameter as high as 1 as possible.

stan(“hlg.stan”, data = data_stan, init_r = 0.5, control = list(adapt_delta = 0.99999), iter=11000, warmup=10000)

I am not sure how to reparametrize the model in this case even though, the reparametrization part of the manual suggested that bernoulli_logit is a good candidate for reparametrization. However, I don’t see very clearly how to do this in this setting.

Best wishes,
Nico

data {
int<lower=1> N;
int<lower=1> J;
int<lower=0,upper=1> y[N];
int<lower=1,upper=J> ll[N];
vector<lower=1>[N] t;
vector<lower=0>[N] pa;
}
parameters {
real alpha;
vector[J] beta;
vector[J] gamma;
}

model {
vector[N] x_beta_ll;
alpha ~ uniform(-5,5);
beta  ~ uniform(-0.5,0.5);
gamma ~ uniform(-0.5,0.5);
for (n in 1:N)
x_beta_ll[n] = alpha + beta[ll[n]]*t[n] + gamma[ll[n]]*pa[n];
y ~ bernoulli_logit(x_beta_ll);
}


Stan struggles with probabilities that evaluate to 0 or 1 and with uniform distributions especially with very constrained ranges (e.g. your beta, and gamma for example). The reason is that flat priors are more difficult to regularise and therefore induces geometries which are more difficult to explore. The issue in your model is the choice of priors. Here is a general piece about prior choice in Stan.

You can get very similar constraints to yours using the normal distribution. Here’s a re-written version that should give you what you want following the above advise.

data {
int<lower=1> N;
int<lower=1> J;
int<lower=0,upper=1> y[N];
int<lower=1,upper=J> ll[N];
vector<lower=1>[N] t;
vector<lower=0>[N] pa;
}
parameters {
real alpha;
vector[J] beta;
vector[J] gamma;
}
model {
alpha ~ normal(0,5);
beta  ~ normal(0,0.5);
gamma ~ normal(0,0.5);
for (n in 1:N)
y ~ bernoulli_logit( alpha + beta[ll[n]]*t[n] + gamma[ll[n]]*pa[n] );
}


The other thing that MCMC with limited iterations is generally sensitive to is the starting point. Stan also has the optimizing and vb (variational bayes) functions. You can use these to get decent starting values for your model, which can greatly reduce time to convergence. If you’re using optimizing (L-BFGS under the hood), make sure to run it a few dozen times at least on a dataset your size and pick the MAP with the best log posterior to start from.

Here are the results with 4 chains and just 500 iterations (it looks like you need more).

                mean se_mean   sd       2.5%        25%        50%        75%      97.5% n_eff Rhat
alpha           0.79    0.00 0.01       0.77       0.78       0.79       0.80       0.81   266 1.01
beta[1]         0.00    0.00 0.01      -0.01       0.00       0.00       0.01       0.02   808 1.00
beta[2]         0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01   839 1.01
beta[3]         0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01   842 1.00
beta[4]         0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   828 1.00
beta[5]         0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01   805 1.01
beta[6]         0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01  1110 1.00
beta[7]         0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01  1008 1.00
beta[8]         0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01  1104 1.00
beta[9]         0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01   871 1.00
beta[10]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01   908 1.00
beta[11]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01  1532 1.00
beta[12]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01  1125 1.00
beta[13]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01   839 1.01
beta[14]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01   892 1.00
beta[15]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01   960 1.00
beta[16]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01   841 1.00
beta[17]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01  1075 1.00
beta[18]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01   613 1.01
beta[19]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01   957 1.00
beta[20]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01  1102 1.00
beta[21]        0.00    0.00 0.01      -0.01       0.00       0.00       0.01       0.02   755 1.01
beta[22]        0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   803 1.00
beta[23]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01  1045 1.00
beta[24]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01  1013 1.01
beta[25]        0.00    0.00 0.00      -0.01       0.00       0.00       0.00       0.01  1043 1.00
beta[26]        0.00    0.00 0.01      -0.01       0.00       0.00       0.01       0.02   930 1.00
beta[27]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01   829 1.00
beta[28]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01   917 1.00
beta[29]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01   678 1.00
beta[30]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01   866 1.00
gamma[1]        0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   967 1.00
gamma[2]        0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1489 1.00
gamma[3]        0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1010 1.00
gamma[4]        0.00    0.00 0.02      -0.03      -0.01       0.00       0.01       0.03   927 1.00
gamma[5]        0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01  1456 1.00
gamma[6]        0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1379 1.00
gamma[7]        0.00    0.00 0.01      -0.01      -0.01       0.00       0.00       0.01  1796 1.00
gamma[8]        0.00    0.00 0.01      -0.01       0.00       0.00       0.01       0.01  1539 1.00
gamma[9]        0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1066 1.00
gamma[10]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1167 1.00
gamma[11]       0.00    0.00 0.01      -0.01       0.00       0.00       0.01       0.01  2377 1.00
gamma[12]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.00       0.02  1240 1.00
gamma[13]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1032 1.00
gamma[14]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.01   981 1.00
gamma[15]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1118 1.00
gamma[16]       0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01  1062 1.00
gamma[17]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1218 1.00
gamma[18]       0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01   872 1.00
gamma[19]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1543 1.00
gamma[20]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1390 1.00
gamma[21]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   892 1.00
gamma[22]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   944 1.00
gamma[23]       0.00    0.00 0.01      -0.01       0.00       0.00       0.00       0.01  1417 1.00
gamma[24]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1208 1.00
gamma[25]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1516 1.00
gamma[26]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   981 1.00
gamma[27]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   974 1.00
gamma[28]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02  1055 1.00
gamma[29]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   823 1.00
gamma[30]       0.00    0.00 0.01      -0.02      -0.01       0.00       0.01       0.02   980 1.00
lp__      -349169.63    0.35 5.72 -349181.79 -349173.22 -349169.50 -349165.69 -349158.96   265 1.02

Samples were drawn using NUTS(diag_e) at Mon Sep 30 11:55:12 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Warning messages:
1: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
2: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess


End-to-end code in R if you want to replicate the results:

require(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

model_str="
data {
int<lower=1> N;
int<lower=1> J;
int<lower=0,upper=1> y[N];
int<lower=1,upper=J> ll[N];
vector<lower=1>[N] t;
vector<lower=0>[N] pa;
}
parameters {
real alpha;
vector[J] beta;
vector[J] gamma;
}
model {
alpha ~ normal(0,5);
beta  ~ normal(0,0.5);
gamma ~ normal(0,0.5);
for (n in 1:N)
y ~ bernoulli_logit( alpha + beta[ll[n]]*t[n] + gamma[ll[n]]*pa[n] );
}
"

M=stan_model(model_code=model_str)
d=list(N=NROW(x), J=30, y=x$y, ll=x$ll, t=x$t, pa=x$pa)

o=optimizing(M,data=d,verbose=T)\$par
inits=function()
list(alpha=o[1], beta=o[2:31], gamma=o[32:length(o)])

sampling(M,data=d,iter=500,init=inits)

4 Likes

Unlikely to help. The MAP is not in the typical set and hence you still need to warmup up in order to converge to the typical set. The hope of this approach is that the MAP is much closer to the typical set then an arbitrary starting point, but that’s unlikely to be true in moderate to high dimensions. Regardless, HMC finds the typical set really fast – typically within O(10) iterations. The bulk of warmup is spent performing initial explorations to inform the adaptation algorithms, and a better starting point doesn’t help that process.

4 Likes

Thank you so much @emiruz I clearly did not pay attention to that (and think did not see that as an advice for the transitions error). My code works now. (Thank also @betanalpha for the followup discussion.)

This model might also benefit from doing the QR trick on the individual regressions, but I’m not sure if that’s commonly done in multi-level regressions. @bgoodri is it common to use the QR trick in multi-level regressions or is that typically only done on single level regressions?

It can be done for both flat and hierarchical models, although in rstanarm we only do it for the \mathbf{X} matrix in the hierarchical case.

1 Like