EBFMI (very) low, looking for reparameterization advice

While I can’t for sure say that very small heterogeneities are impossible in principle here, I would definitely feel confident saying that small values of \sigma_r (say lower than 0.1) are not very likely, and that a value of 0.1 is a lot more likely than a value of 0.01. I have seen (and think I mostly understand) that the expectation of a prior is more important than its mode, and that’s why half-normal is a reasonable choice for a prior on a standard deviation.

On the other hand, with half-normal priors on the sigmas I had a lot of divergences (>10% of post-warmup iterations), and pairs plots showed that most of these divergences happened at values of \sigma_r very close to zero. With Weibull priors I had no divergences.

I get the sense that there are different approaches to priors, and sometimes we pick priors to avoid sampling difficulties rather than to express previous knowledge or based on some principled approach to lacking previous knowledge. I guess it comes down to whether I’m willing to compromise on principle for the practical benefit of getting an answer. I don’t really have a good sense, though, of how much doubt it casts on my answer if the priors were selected on the basis that they permit an answer at all.

And back to the model in the first post, its low EBFMI, and the correlation between sigma_r and Energy__:

I don’t think I looked at logdrt vs sigma_r. Since I don’t really care about logdrt itself, I’m fine integrating it out, but I’d be willing to go back and look at the other model. If I did observe a funnel between sigma_r and one or more logdrt, what would be the next step?

The divergences at low values of the standard deviation suggest that the sampler was unable to explore this region of parameter space properly. When you introduced the prior, you effectively told the sampler “don’t bother exploring this region of parameter space”. Either way, the sampler doesn’t explore that region of parameter space. If you’re confident that this region of parameter space is unimportant, then that’s fine. However, the very fact that the sampler was interested in visiting this region of space under the half-normal prior suggests that the data don’t contain information strong enough to confirm the unimportance of this region.

But from where are you getting that presumed scale?

A philosophical argument that the variation/heterogeneity can’t be exactly zero doesn’t place any restriction on variation/heterogeneity that is so small that it is practically indistinguishable from zero. Those small but non-zero model configurations will cause introduce plenty degenerate geometry even if zero variation/heterogeneity has been excluded. For example if the model has been parameterized such as 1 is the natural scale for the variation/heterogeneity then while a prior that suppresses values below 10^{-6} technically excludes complete homogeneity it won’t suppress enough of the funnel geometry to avoid computational problems.

Explicit lower scales for variation/heterogeneity of a measurement process, population, etc are tricky to elicit, especially outside of the context of previous measurements such as calibration of detectors. Upper scales are much easier to elicit because they’re bounded by the total overall magnitude of the behavior of interest.

Yes if one can do the work to elicit a principled lower scale for the variation/heterogeneity and if that scale is large enough to suppress the worst of the funnel geometry then one can avoid the problematic computation. Those are big ifs, however, and presenting a zero-suppressing prior as a cure-all will only encourage people to take short cuts.

In my opinion it’s better to think of a prior by it’s containment. A half-normal prior approximately contains values between 0 and 2-3 times the scale parameter, with a slightly tail that allows for leakage in case the scale parameter was elicited poorly.

The question at hand here is whether or not the prior model should also contain away from values near zero, and then if so then by how much.

Sure, but in what precise quantification are you comfortable? A prior that suppresses values below a lower scale of 0.01 might not actually be enough to avoid the computational problems.

Because the Weibull prior cuts off the bottom of the funnel. See for example Section 3 of Hierarchical Modeling. Section 3.2 discusses the particular degeneracy that is manifesting here but the figures at the end of Section 3.1 also show what would happen with a zero-suppressing prior model for the population scale.

Validating a dangerous idea like “picking a prior to avoid sampling difficulties” is exactly why we have to be careful with the language we use when talking about prior modeling!

The fact that the sampler encountered those problematic model configurations in the first place means that they’re consistent with the observed data. If one can elicit domain expertise that happens to suppress problematic behavior then absolutely it can be incorporate into a model to great effect; see for example the discussion in Identity Crisis. But making up a prior model to avoid problematic behavior that doesn’t consider, or outright conflicts with, any meaningful domain expertise typically leads to posterior inferences, predictions, and the like that also conflict with that domain expertise.

But what is the value of an “answer” if you don’t know what it means? Yes in practice we often have to compromise due to limited resources – in how complex of a model we can build, in how precise a prior we can elicit, in how faithfully we can computationally recover posterior inferences – but if we can’t actually quantify the tradeoffs of a given compromise then we don’t know what dangers we’re introducing.

Ignoring the constraint log_d ~ normal(mu_r, sigma_r) is a normal hierarchical model which are fundamentally prone to funnel-like geometries that cause divergences and E-FMI problems. For much more see again Hierarchical Modeling, especially Section 3.

Without the constraint the strategies mentioned in that case study would be applicable, but with the constraint there aren’t as many options to get rid of the funnel geometry that’s cause E-FMI and divergence problems as \sigma_{r} approaches small values. A zero-suppressing prior may help, but the suppression has to reach large enough values. Analytic integration won’t be possible due to the constraint, and while numerical integration will be expensive it might work.

Same place source as sets the scale of the traditional ~normal(0,x). But I think you’re expressing that my favoured ~weibull(2,x) actually reflects the imposition of more prior information despite the seeming equal role of the scale parameter in both prior types, because (thinking out loud here) the weibull distribution encodes a broad family of shapes, and my choice of 2 defines more properties of the expectations than does the choice of zero in the case of the normal.

I was momentarily swayed by @jsocolar’s point that the sampler’s traversal into pathologically small variances when these are not suppressed might be a sign that they might be supported by the data, but then: can’t low-information data combined with a zero-peaked prior yield the same behaviour (I.e. traversal into low variance realms causing divergences precisely because the prior dominates)? If that’s true then the observed behaviour isn’t diagnostic that the data actually support those low values. Jesus, am I about to suggest a Bayes factor to adjudicate here? 😵

The important distinction is that a half-normal only suppresses values that approach positive infinity. Depending on the choice of x in weibull(2, x), not to mention which parameterization is being used, the Weibull can either suppress positive infinity or positive infinity and zero. In that later case it requires substantially more qualitative domain expertise to motivate.

If the sampler explores a model configuration then it, by definition, is reasonably supported by the data, as encoded by the likelihood function, and the included domain expertise, as encoded by the prior model. Weakly-informative likelihood functions just mean that more model configurations are consistent with the data, and if the likelihood function is relatively flat towards zero heterogeneity then all of those nearly homogeneous configurations are consistent with the observed data.

The only way to avoid those configurations is to introduce domain expertise that conflicts with them, but again that requires a careful elicitation of that domain expertise and not just throwing down some number because it makes the Stan diagnostic warnings go away.

1 Like

Well put and thanks, I appreciate your expertise in these matters 🙂

I greatly appreciate the detailed and thoughtful discussions, here. I don’t think I’ve seen this advice elsewhere, but it’s very interesting:

Are there case studies or papers you can recommend to better understand how to operationalize this idea of containment?

Getting back to the model itself, taking the approach of integrating out the E and drt variables seems to be working with half-normal priors on the sigmas.

Edit: These following paragraphs turned out to be wrong. I was having divergences because I was trying to fit a very small simulated data set. I’d set the size low because I was debugging the integration and I forgot to set it back to a reasonable level for model testing! Once it was back to a reasonable size, there were no longer divergences with the half-normal priors, even for small values of sigma.

After more experimentation it seems like I’m running into sampling problems only when the true (simulated) value of sigma_r is very low. As long as it has a reasonable value, the data keep the sampler from getting into dangerously low regions of sigma_r.

I’d feel better if I understood what was going on with very small sigma_r values; I suspect it’s some interaction with both mu_a and sigma_a simultaneously. Can a parameter still cause a funnel when it’s been integrated out?

Conveniently I do have a piece coming out at the end of the month (or available now to my Patreon supporters). Check back on my writing page, Writing - betanalpha.github.io, by November 1st.

The fewer data the less informative the realized likelihood functions will be, and the more diffuse the posterior for \sigma_r will be. This then exposes more of the log_d-sigma_r funnel, especially the neck at small values of sigma_r, which then causes the computational problems.

At the same time the smaller the true value of sigma_r the more the realized likelihood function, and hence the posterior distribution, will concentrate at small values of sigma_r where the log_d-sigma_r funnel behaves the worst. Even with large data sets you can run into unfortunate geometries that frustrate computation here.

It could be a mu_r-sigma_r degeneracy or a log_d-sigma_r degeneracy. Again see for a discussion of the two.

The problem when fitting these models with Stan is that Stan explores the entire parameter space, and has to confront interactions between all of the parameters. There’s no way to “integrate out” variables to avoid these problematic interactions unless you can do it before running Markov chain Monte Carlo. Sometimes, although rarely, this can be done analytically. Tools like INLA do this approximately for a limited class of models.

1 Like