Divergences and Quantifying HPD intervals

I’ve been running a stan program for a rather odd model, a hierarchical model of different experiments which can individually pass, returning a strongly peaked likelihood function, or fail, returning a flat likelihood function. For each experiment, a failure probability is pre computed (treated as data) and the posterior ends up being a mixture of the posteriors generated by the peaked (pass) likelihood and the flat (failure) likelihood.

Because of the odd model setup, if most of the data have a high failure probability, then the posterior distribution contains plateaus in its tails which I think cause divergences (and sometimes in extreme cases low E-BFMI) in the stan model. However, intuitively the results this model produces with the data I have seem to be “correct” and have low

If I use data which produce divergences, the overall shape and median of the posterior is generally unaffected, but the 95% HPD may be significantly affected by the uneven distribution of samples in these tails. Often this isn’t an issue (if the HPD is just very large, the inference made is that more data need to be sampled and so the values don’t need to be particularly accurate), but it would be nice to quantify the uncertainty in the 95% HPD itself as there are some intermediate cases. Perhaps there is something in the generated quantities block that I could do to calculate percentiles? Or perhaps the most naive way to do this would just to be to calculate the difference in percentiles between chains?

Sorry it took a little while to get a response!

Hmm, that sounds more like it would result in hitting max treedepth. Divergences are associated with regions when the curvature changes drastically more than plateaus (so maybe in this case the issue is transitioning from the plateau to the high curvature?). @betanalpha is the top expert on this so if I’m wrong he may be able to clarify further.

Do you mean compute the percentiles of the posterior distribution or something else? To compute the percentiles of the posterior you can do that as a post-processing step using the quantile function in R or the equivalent in whatever platform you’re using. You need to do it after fitting the model because it requires all the samples, while the generated quantities block is giving you access to a single iteration at a time. But I may be totally misunderstanding what you’re trying to do, so please let me know!

I do mean to compute the percentiles of the posterior. The issue is that where you have a long tailed distribution with plateaus in the tails I find that Stan inefficiently samples from the plateaus and so the 95% HPD can vary quite significantly for heavy tailed distributions. I think your point about the curvature is probably correct, the number of divergences seems to be correlated with how heavy tailed the distribution is. For any of my models, I end up with a small but not infinitesimal probability everywhere within my prior which could definitely lead to small numbers of divergences even in light-tailed models, which will sometimes lead to 1 or 2 divergences in 100000 iterations, but leads to little difference in the result. I think this type of problem wouldn’t be an issue for calculating the percentile of a particular value close to the mean of the distribution (e.g. the probability that you’re less than some expected value) but becomes problematic when looking at a 95% HPD.

I think that for a posterior of this form, it might be best solved using numerical integration rather than MCMC. I’m looking at a hierarchical model where there are n+2 dimensions, but I think the challenging part is integrating over the hyperprior (which only has to be done once). Once I’ve marginalized out a scale hyperparameter, then I think I can just reduce the problem to n two dimensional matrices of P(\theta_i,\mu|y_i) i.e. a mean hyperparameter and the individual parameters for the data. This may even be marginally faster than using MCMC, but I’m still using Stan to precompute failure probabilities for this model.

Plateaus can cause stability issues through floating point instabilities. For example a plateau that stretches out towards infinity will cause result in divergences around exp(700) when the range of double precision floating point ends. Plateaus that stretch towards a finite boundary can cause problems because of the implicit Jacobian – this happens for example with a beta model that concentrates too strongly at 0 or 1.

That said I’m not sure why there would be a plateau in the posterior density function here provided that there are decent priors on the hierarchical population location and scale. There are decent priors, right…?

My guess is that this actually is a curvature problem due to the varying informativeness of the likelihood functions. In particular the hierarchies for the parameters with peaked likelihood functions probably need to be centered while the hierarchical for the parameters with non-peaked likelihood functions probably need to be non-centered.

This isn’t an issue of sampling efficiency. You’d see the same issues with Monte Carlo estimators that use exact samples – the problem is that the Monte Carlo and Markov chain Monte Carlo estimator variance scales with the true variance of the expectand (function whose expectation value is being taken). Heavier tails imply larger variances and noisier estimators.


My “likelihood” looks something like P(y_i|\theta_i,pass_i)P(pass_i)+P(y_i| \theta_i,fail_i)( 1-P(pass_i)) (P(pass_i) is a piece of prior information, but it’s a precomputed value so it’s effectively treated as data in the stan model). Effectively, in a failed experiment, my y_i could have been produced by any poorly understood process and are likely to bias my estimate of my population location parameter \mu, so P(y_i| \theta_i,fail) is set to simply have a constant value. The location and scale parameters for my hierarchy both have uniform priors with bounds, so in the worst case scenario where P(pass_i)=0 for all i (Failure rates for this experiment can be high, this is not an impossible scenario) then the posterior on the location parameter ought to be uniform too. From my results, the worst issues seems to arise when P(pass_i) is close to, but not exactly 0 for all parameters which leads to a likelihood function for \theta_i with one peak in the middle of a plateau. In these cases my chains end up with low E-BFMI and sometimes won’t converge (large RHat).

You can sort of quantify an amount of information that a set of experiments contains from this model by using \sum_{i=1}^n P(pass_i) which seems to be a pretty good predictor of how well the sampler will do. The inference that more data needs to be collected where this number is low remains the same, so it’s not essential that the sampler be highly accurate in these cases. However, I would like to know why this behavior occurs in case it causes problems for my better datasets. I think the curvature issue is probably what’s causing this, unfortunately I don’t know if it’s solvable for this model if that’s the case.

There are lots of modeling warning signs here – passing in P(\mathrm{pass}) in a two-stage inference instead of inferring it jointly with everything else, the abundance of uniform priors, etc – but I’m going to ignore those as they’re not the topic of conversation.

I think the main problem here is that when building a mixture model you can’t just define heuristic “likelihood functions” that don’t corresponding to well-defined observational models because the relative normalizations matter! Recall that a Stan program specifies the log “likelihood”,

\begin{align*} \log \pi(y_i, \theta_i^\mathrm{pass}, \theta_i^{\mathrm{fail}}) &= \log ( P_{i} \pi_{\mathrm{pass}}(y_{i} \mid \theta_{i}^\mathrm{pass}) + (1 - P_{i}) \pi_{\mathrm{fail}}(y_{i} \mid \theta_{i}^\mathrm{fail}) ) \\ &= \log ( P_{i} \pi_{\mathrm{pass}}(y_{i} \mid \theta_{i}^\mathrm{pass}) + \mathrm{const} ) \end{align*}

The value of the log likelihood will depends on the arbitrary value of \mathrm{const}. When \mathrm{const} is large then the contribution from \pi_{\mathrm{pass}} is suppressed, but when it’s exactly zero the log likelihood reduces to \log \pi_{\mathrm{pass}} with all of the other constants dropping out.

That said the main pathology might actually be hiding elsewhere. If there is a hierarchy over the \theta_{i} then the wacky shape of this heuristic likelihood function will probably require very delicate handling of the parameterization of the \theta_{i}, centering the few that have large P_{i} and non-centering those that have small P_{i}.

As always the most productive way forward will be to simplify your model. Try to fit the mixture without any hierarchical structure and then add the hierarchy back in only once you can fit the non-hierarchical mixture without problems.

This constant value is somewhat non-arbitrary. What’s going on to calculate P_i is a sort of model comparison, so there’s already some integration over the parameter space being done at the non-hierarchical stage. What I’m doing currently in the hierarchical model is ensuring that the whole “likelihood” function sums to 1 over the unpooled prior space (although I think it may not over the hierarchical prior space because of the high dimensional P(\theta_i|\mu,\sigma) term).

I think this may be the case. For a non hierarchical model, I can fit things nicely in Stan but introducing the hierarchy causes problems. Really, the whole reason for introducing the P_{i} term in there is that there is some uncertainty in the decision whether to center \theta_i or not, as doing so in cases of low P_{i} is likely to lead to bias, but entirely leaving things out leads to loss of potentially useful information.

Perhaps I’m incorporating this in the wrong place and I should be incorporating the mixture directly in the hierarchy, so we end up with a P(\theta_i|\mu,\sigma,P_{i}) which could look like some kind of mixture between e.g. a Gaussian and a uniform or something similar. This might lead to similar issues but placing this in the prior might lead to more well behaved posteriors? This would avoid the constant integration issue that we discussed above as I believe all priors in Stan use probability densities which integrate to one? I’m not sure exactly how I’d incorporate this type of prior in Stan, I imagine I would have to increment the target by the log of the sum of the exponential of the two prior densities multiplied by the prior like I was doing for the “likelihood” previously.

Ok, so I have significantly improved this model in the time between this and my last post, but I am still having problems with divergences. Instead of using a system where individual experiment can either pass or fail, I attempt to use another parameter fit to estimate an amount of bias that this causes. I assume that one of my parameters I fit to the data (let’s call this \theta) is being biased away from the hierarchical mean by some other parameter I fit to the data (let’s call it \nu). I’m pretty happy with my likelihood given these parameters, and we then have a top level set of hyper parameters that say that:
\theta\sim \mathrm{Normal}(\beta_0+\nu\beta_1,\sigma)
where \beta_0 is the parameter of interest.

I want to put uninformative priors on both \beta_1 and \sigma as I have little information about what these parameters should be. I know that using an improper prior of P(\beta,\sigma)\propto 1/\sigma ought to lead to a proper posterior if n>2 according to Gelman’s book. Implementing this in stan, I simply don’t explicitly define a prior on \beta_1, and initialize \sigma as \log \sigma without explicitly defining a prior on it either. I have a proper prior on \beta_0 as I have a much better grasp of the range of values this parameter can actually take.

Unfortunately if n is only just >2 (e.g. I have some datasets where n=3) this performs really poorly with divergences, and in some cases very low n_eff and Rhat. This doesn’t occur in a model where my data are unpooled, so it’s clearly to do with these priors (and probably the high curvature of the posterior when n=3). Interestingly, if I use a proper prior on \sigma (e.g. a half normal with large standard deviation) then the model does significantly better (though there are still some divergences). However, I’m not certain that this leads to a proper posterior with the improper prior on \beta_1, so this might cause problems.

Any suggestions for parameterizations or priors for good sampling with models of this type? The desired result when n=3 is that uncertainties should be large in any case as the data should be pretty weakly identifiable, but I would prefer to not have divergences if possible.

Edit: I tried the non-centered parameterization for hierarchical models detailed here. This seemed to reduce the overall number of divergences (although not remove them completely) and increase n_eff, but also led to some samples hitting max treedepth which made the model run significantly slower overall.