Help understanding BFMI (interpreting BFMI > 1)

Some colleagues and I have been trying to get a slighter deeper understanding of the BFMI diagnostic. For what it’s worth, we’ve read @betanalpha’s arxiv:1604.00695 and arxiv:1701.02434 which go into some detail. (We have not yet gone back to, e.g., the work from Rubin which is referenced.)

In those papers we want the conditional energy distribution p(E|q) to explore the full posterior energy distribution, p(E). So we compare the variances: \mathrm{BFMI}={\rm var}(E|q)/{\rm var}(E).

Because the momentum changes at constant coordinate with every step, this is estimated as \sum (E_{n} - E_{n-1})^2/\sum (E_n - {\bar E})^2.

But the numerator is the difference of two random variables; in the limit of independent samples, it seems like we should expect \mathrm{BFMI}=2. And indeed if I use stan on a particularly simple problem (estimating the mean of iid gaussian samples), I see \mathrm{BFMI}>1, with values around 1.1 with adaptation and 1.3 without, and no other sampling pathologies from any of the diagnostics (and it gets the right answer). And of course the histograms are markedly different, with the marginal energy distribution quite skew, but the two distribution matching fairly well in the high-energy tails (perhaps this is enough?)

In more standard “difficult” problems, I do indeed usually see something like \mathrm{BFMI}\simeq 0.85, but I would like to have a better understanding of how to interpret things in both cases.

Ultimately the “fraction of missing information” is just a statistic that contrasts marginal behavior to conditional behavior in a given joint distribution. In other words it is just one way to quantify the non-uniformity of the “missing” marginal pi(q) in

\pi(E) = \int \mathrm{d} q \, \pi(E \mid q) \, \pi(q),

and how much information about \pi(E) that any conditional \pi(E \mid q) is missing.

Most importantly when the fraction of missing information is small then \pi(E \mid q) is much narrower than \pi(E). In the context of Hamiltonian Monte Carlo this means that the momenta refreshments are doing a poor job of exploring between Hamiltonian level sets and that the resulting Markov chain exploration will be slow no matter how effectively each Hamiltonian trajectory explores the individual level sets.

This typically happens when the posterior density function exhibits a sufficiently high-dimensional funnel: see for example the discussion in Section 3.3 and 4.1 of Hierarchical Modeling.

[Terminological tangent: I first encountered this statistics in a paper on Bayesian hierarchical modeling, hence the “Bayesian” fraction of missing information. The Hamiltonian Monte Carlo application, however, has nothing to do with Bayesian inference and I recommend dropping the “B” in the diagnostic.]

The less pathological behavior depends on the details of the problem at hand. For low-dimensional systems the marginal energy distribution \pi(E) – the “density of states” in the physics parlance – is typically bounded by some smallest value where as the conditional energy changes from momenta resampling are not. This difference in support can result in awkward artifacts in the variances which then skews the fraction of missing information ratio.

At higher-dimensions reasonably well-behaved systems tend to result in marginal energy distributions that concentrate above the lower bound and become better approximated with normal density functions. In this case the energy fraction of missing information ratio might converge to something reasonable well understood.

That said I don’t think that there’s much insight to be gained from examining this behavior in detail. So long as E-FMI > 0.2 or so the momenta resamplings will be able to move across all of the relevant energy level sets after just a few iterations and the overall performance of the Markov chain will depend on how effectively each Hamiltonian trajectory explores the energy level sets.

You mean any conditional \pi(E \mid q), right?

1 Like

Yes, thanks! I’ve edited the text accordingly.