Issues with E-BFMI (missing doc, confusing name, etc.)

disclaimer: I think the quantity E-BFMI is useful, the formula makes sense, and the technical explanation in https://arxiv.org/pdf/1604.00695.pdf is good. These are just comments on how I think we’re not communicating this stuff well enough to our users:

  1. As far as I can tell, E-BFMI is not mentioned anywhere in the users guide or reference manual. (In the CmdStan guide there’s at least a link to one of @betanalpha’s papers, which is great and more than I found anywhere else, but that doesn’t count as software documentation.)

  2. The name “fraction of missing information” is confusing. I know that “fraction of missing information” has a history (we didn’t invent the name) but it’s not intuitive at all that a low fraction of missing information is bad! Intuitively it should be the opposite. I think it’s unfair to our users to not document this and then to also use a name that contradicts basic intuition.

  3. Why does it have “Bayesian” in the name? What’s Bayesian about this other than the fact that we’re fitting Bayesian models? We don’t refer to effective sample size as “Bayesian effective sample size” just because we’re applying it in a Bayesian context. Am I missing something about this?

Anyway, just some thoughts motivated by chatting with @jsocolar about E-BFMI warnings in our various interfaces (he also pointed out that RStan and CmdStan use different thresholds for E-BFMI warnings, which also seems suboptimal). Curious what other people think about this. Thanks!

7 Likes

100% agree. Indeed, I’ve started using rebmfi, where the r is for reciprocal, and using a “something’s wrong” criterion of >3

3 Likes

I don’t “know” know what ebfmi is, only that this has always confused me:

I just always hope CmdStan’s diagnose knows what it’s doing…

3 Likes

Spitballing better names….

Since the literal computation involves (if reciprocated to make big bad) the ratio of the Marginal Energy Variance to energy transition (\Delta_e) variance, how about “MEVDEV Ratio”? Or, since inclusion of “marginal” is somewhat over-precise in my opinion, “EVDEV”?

I guess we could also just make up a symbol too. e-hat or something.

I bet most people are in the same boat because it’s not described in detail anywhere in any Stan materials, only in some papers. The explanations in the papers are good but they don’t explain the confusing aspects of the name that I mentioned above (which is fine but we should do that in the doc).

It doesn’t know what the threshold should be for warning the user (it’s just fixed at a level that I guess has some support from experiments but has also been described as “too conservative”), but otherwise it seems to compute the correct number given the fixed threshold it uses.

Edit: to be fair, we don’t know what threshold we should use for other diagnostics either (e.g. R-hat, ESS). They’re all based on experiments with some theory sprinkled in occasionally, as far as I can tell.

2 Likes

With the disclaimer that I have no idea what I’m talking about… I wonder whether the correct parsing is \frac{Bayesian\,missing\;information} {missing \;information} (as opposed to Bayesian \frac{missing\;information}{information}). So if there’s some amount of missing information (whatever that means–is this information in the information theoretic sense?), then we want most of that missing information to belong to the Bayesian fraction rather than some other fraction.

Huh, now that I think about it, is the idea that the PDF of energies should be the same as the PDF of energy transitions (possibly after some location shift has been accounted for?)? Or just the variances? If the full PDF, then a qqplot would be a good visual diagnostic. Which in turn suggest alternate measures of fit like linear r-squared, ks-stat, etc.

Oh! And doesn’t this suggest another across-chain diagnostic? Where you make sure all pair wide combos of PDFs match?

Also: is this expectation (that the variance-or-pdf of the X are equal to that of the lag1diff(X)) apply just to energies, or do all quantities (lp__, parameters) share this expectation?

This is closely related to ESS for the other parameters, right?

1 Like

As @jonah noted the diagnostic was introduced in [1604.00695] Diagnosing Suboptimal Cotangent Disintegrations in Hamiltonian Monte Carlo but it’s also discussed in my “Conceptual Introduction to Hamiltonian Monte Carlo”, [1701.02434] A Conceptual Introduction to Hamiltonian Monte Carlo as well as my degeneracy case study, Identity Crisis, and hierarchical modeling case study, https://betanalpha.github.io/assets/case_studies/hierarchical_modeling.html (see in particular Section 3.3 for a conceptual discussion and Section 4.1 for code examples).

Honestly the diagnostics across the interfaces are a bit of a mess right now. How can there be consistent documentation when the interfaces are not using the same diagnostics? Long ago we had discussed moving diagnostics in core services, some of which have been implemented in https://github.com/stan-dev/stan/tree/develop/src/stan/analyze/mcmc, but since then the interfaces have largely governed themselves. I strongly support any attempt to unify the diagnostics across the interfaces – either by policy or relying on C++ services – so that we could then document them in one place.

I agree that it’s not fair to not document any automated diagnostics, but I don’t think that we can rely on intuition here.

As you note “fraction of missing information” has a technical definition in the statistics literature Mathematically it provides a way of quantifying how different a conditional distribution is relative to a corresponding marginal distribution, but like so many terms in statistics it’s defined not by the mathematical definition but rather an initial application (in this case in imputation). That said “fraction of missing information” does have some intuition in the sense of quantifying what information a marginal distribution is “missing” relative to the collection of distributions in a conditional distribution, although the numerical values are weird. Unfortunately this is one of those notational sins that are hard to avoid in statistics.

More importantly the application of this comparison to energies in Hamiltonian Monte Carlo requires a nontrivial understanding of the workings of Hamiltonian Monte Carlo, and no intuition will be valid without that understanding.

From that perspective I don’t think this diagnostic is much different from the other diagnostics. Most users won’t understand what they mean, let alone what values are good or bad, but with proper documentation they can at least identify when they shouldn’t trust the accuracy of their fit and where they can learn about possible reasons why the diagnostics is failing and potential resolution strategies.

Perhaps controversially I don’t think we should try to force intuition that isn’t there. For example there’s already a wealth of misunderstanding about effective sample size because it’s presented in simplified forms that bely its more complicated nature.

Yeah, “Bayesian” ended up being a red herring. I originally picked up the mathematical form from a paper about Bayesian hierarchical modeling where it was referred to as the “Bayesian Fraction of Missing Information”. In hindsight I realized that “Bayesian” referred to applying the “Fraction of Missing Information” to compare conditionals and marginals in hierarchical posterior distributions, and have since dropped the “Bayesian” (see for example the case studies). Then again as noted above “Fraction of Missing Information” isn’t necessarily applicable, either.

I have brought this up in the past.

5 Likes

Thanks @betanalpha, that all makes sense.

I still think that’s a good idea. We’ve implemented R versions of many of the diagnostics because (1) they’re not all available in core services and (2) it’s useful to also (not exclusively) have implementations that don’t depend on Stan C++. But I do think it would be better to have the diagnostics implemented in Stan services and then officially documented.

1 Like

Tell me about it, I just reimplemented all of cmdstan’s diagnose.cpp just because I wanted to have a nice python function instead. (Also, diagnose takes forever(>1s) for large files)

But, to be honest, it’s not even that much work…

2 Likes

Haha, when I have 5 chains with 250K parameters I’m happy if rhat finishes in several minutes :P

1 Like

@betanalpha thanks for another lucid explanation of this stuff

So just to recap the “Fraction of Missing Information” name, and make sure I have this right:

  • FMI was originally applied in the context of multiple imputation, where it refers to the ratio of the sampling variation of some summary statistic \theta between imputations (the between-imputation variance) to the sampling variation across imputations (the total variance).
  • Implicit in the above is a within-imputation variance component. Thus, the in the multiple imputation context FMI quantifies the fraction of “information” (sensu lato, I think?) about the sampling distribution for \theta that is missing from the sampling distribution that we observe for a single imputation. So that’s where the name comes from.
  • In the energy-FMI context, we are looking at the ratio of the variance in energy transitions \Sigma_{resampling} compared to the total variance in energy \Sigma_{total}.
  • But for some reason the appropriate formulation for the e-FMI variance ratio involves a formulation for \Sigma_{resampling} that is conceptually and/or mathematically equivalent to the between-imputation variance component above, and not the within-imputation component.
  • Thus, we are stuck calling the case where \Sigma_{resampling} is high a “high e-FMI”.

Is that (at least approximately) correct?

@jsocolar @Funko_Unko FYI I’m working on during-sampling redirection of outputs to a storage format that should make the computation of diagnostics far faster

2 Likes

Anything not csv/text based is probably an improvement, isnt it?

Edit: Though, much of the time should be spent computing anyways.

1 Like

Hi,
without pretending I understand the technical part very well, I think the user confusion (why “low missing information” is bad) could be easily handled by just using “E-BFMI” or “BFMI” without actually expanding the acronym in all messages and short-form documentation. Instead of explaining the acronym we would just explain what the diagnostic (roughly) tells you. I.e. use “E-BFMI” the same way we use “R-hat” - we also don’t explain why it is called “R-hat”. Interested readers then may be redirected to more detailed treatment that explains the name and - more importantly - the math (as we do with R-hat).

So taking the description at Runtime warnings and convergence problems as a starting point, one would have:

You may see a warning that says some number of chains had low E-BFMI. This implies that the adaptation phase of the Markov Chains did not turn out well and those chains likely did not explore the posterior distribution efficiently. For more details on this diagnostic, see https://arxiv.org/abs/1604.00695

I don’t think anything has been lost by not expanding the acronym here.

3 Likes

Basically, although a more useful abstraction is comparing conditional distributions to marginal distributions. Given some joint distribution \pi(x, y) the “fraction of missing information” compares the average variance of the conditional distribution \pi(x \mid y) to the variance of the marginal distribution \pi(x). In other words it provides a quantification of how much information about the joint distribution of x is “missing” from the marginal distribution \pi(x).

Historically I believe I first encountered the diagnostic is a paper about Bayesian hierarchal modeling, which took the ratio and the name from earlier imputation papers. I tried to find that intermediate reference but it seems that I stopped keeping track of it after digging to the original imputation references.

Yes, and in the Hamiltonian Monte Carlo context this is best interpreted from the conditional/marginal perspective.

1 Like