Fixing "BFMI was low" problems

There is a short writeup at http://mc-stan.org/misc/warnings.html#bfmi-low about what it means when BFMI is low, but I am at a loss as to what to do about it. One of the suggestions given is “reparameterize your model,” but what sort of reparameterization helps here?

Also… I tried to read the referenced paper on BFMI at https://arxiv.org/pdf/1604.00695.pdf, but I couldn’t make much sense of it. I understand MCMC well, I understand the basics of HMC, and I speak measure theory, but I don’t know what the notation T*Q means in the context of this paper, and terms like “momentum fiber” and “cotangent bundle,” and “cotangent disintegration” don’t mean anything to me. Can anyone suggest what I should read first to fill in the assumed background for this paper?

1 Like

You may prefer the conceptual introduction, https://arxiv.org/abs/1701.02434, which discusses the energy diagnostic outside of the geometric foundations.

The necessary mathematics for the formal discussion are discussed in https://arxiv.org/abs/1410.5110 and the references therein.

We set the E-BFMI warning threshold at 0.3 based on some empirical heuristics but that may have been a bit conservative given how robustly dynamic HMC handles even extremely heavy-tailed distributions. Still, the performance can be greatly improved by the same reparameterizations that would increase E-BFMI above that threshold so it’s probably a good idea anyways.

A good example of a helpful reparameterization is using a gamma mixture of normals instead of Cauchy. Similarly, see the parameterizations used in the Finnish horseshoe of @avehtari and colleagues.

2 Likes

Thanks for the references. Ironically, the problem I’m wrestling with already reparameterizes a student-t distribution as a gamma mixture of normals. It’s a state-space model with local level, local trend, weekly seasonality (daily data), and a student-t distribution for observation errors implemented by having a separate observation error sigma[t] for each time step p, with independent gamma priors on the variables sigma[t].

I’m still unclear as to what kind of reparameterization helps with low BFMI. It sounds like maybe I should be trying a parameterization that avoids long tails in the posterior, but when I do density plots for parameters and energy__ I don’t see any long tails, despite getting BFMI warnings for all four chains.

Here are some more specifics. The time series is of length 1000. Using 2000 warmup and 2000 sampling iterations, with four chains running in parallel, it takes nearly three hours to run the estimation on my year-old laptop computer. (This is a big problem, as I need to scale up to running lots of analyses.) Rhat looks good for everything, but n_eff is relatively low at 586 for the degrees-of-freedom parameter nu of the student-t. The parameter nu has a posterior correlation of -0.86 with with energy__, and a posterior correlation of +0.70 with the student-t scale parameter. The student-t scale has a posterior correlation of -0.65 with energy__.

Firstly, before complaining about Stan being slow might I recommend considering the alternatives? Being able to fit MCMC accurately on state space models with thousands of dimensions within a few hours on commodity hardware is pretty novel compared to what was possible even just a few years ago.

Heavy tails are one feature that manifest in low E-BFMI but they are not the only feature, and the other possible causes have to this point been sufficiently obscure that we haven’t had a great chance to analyze them. If you can reduce your problem into one that can be shared then we can try fitting it ourselves and see if we recognize anything, but otherwise these problems are too hard to diagnose in general because there are so many possible pathologies.

Oh, and keep in mind that the default output of RStan is on the constrained scale where as the algorithms are fitting on the unconstrained scale. If you have any constrained parameters it is often easier to diagnose using the unconstrained values – look up “diagnose_file” for information on how to access those values.

It’s very hard to help with efficiency issues by guesswork. How did you code the state-space model? Using something like the forward dynamic programming algorithm for HMMs? Are redundant computatiosn removed?

An n_eff of 500 is more than you need for most inferential purposes. After 100 draws, your MCMC standard error is 1/10 of the posterior standard deviation and then you need 400 draws to get to 1/20, 900 draws to get to 1/30 and so on.

We’re also always happy to take contributions aimed at making Stan faster!

I’m not complaining, Michael, and I really do appreciate all the great work you and the rest of the Stan team are doing. (I’ve described you to my coworkers as “a defector from the physics community who has been leaking their secrets to the Bayesian stats community” :-) ) I’m just trying to understand how I can improve performance, because I’m going to have to run a large number of these models on a regular basis.

Hi, Bob. I’m building my model on top of Jeffrey B. Arnold’s “State Space Models in Stan” (https://github.com/jrnold/ssmodels-in-stan and https://jrnold.github.io/ssmodels-in-stan/), in particular using his ssm_lpdf() function, which uses the standard dynamic programming algorithm to compute the likelihood.

Yes, I understand that n_eff=400 is already sufficient; I mentioned the n_eff of 586 (for 4 chains and 2000 sampling iterations) only because it’s lower than what I’m used to seeing. I usually can get n_eff > 400 with only 1000 sampling iterations.

I am learning some things that might be of use to anyone else trying to implement time series models in Stan. For example, I originally implemented a single-source-of-error model (as described in Hyndman et al.'s book “Forecasting with Exponential Smoothing”) with the initial system state as an explicit variable (not integrated out), and found that I got a funnel interaction between the initial trend and the evolution variance for the trend.

Thanks for the feedback on these models.

Given that Stan doesn’t have discrete parameters, I’m not quite sure what you mean by treating the “initial system state as an explicit variable”. Or how you might go about integrating out something that’s continuous.

By “system state” I mean the latent state vector alpha[t] that evolves from one time step to the next. In a single-source-of-error state-space model the observation and evolution errors are linked:

y[t] = w' * alpha[t] + epsilon[t]
epsilon[t] ~ normal(0, sigma)
alpha[t+1] = F * alpha[t] + epsilon[t] * g
alpha[1] ~ multinormal(mu0, Sigma0)

where F is a matrix and g and each alpha[t] are vectors. In a SSOE model it turns out that once you know alpha[1] and y[1],…,y[T], the values epsilon[1],…,epsilon[T] and alpha[2], …, alpha[T] are uniquely determined. The vector alpha[1] is the “initial system state” I mentioned. A direct (and unoptimized) implementation in Stan looks something like this:

parameters {
    vector[N] alpha1;
    ...
}
transformed parameters {
  vector[T] epsilon;
  // code to compute each epsilon[t] from alpha1 and y[1],...,y[t]
  ...
}
model {
  epsilon ~ normal(0, sigma);
  ...
}

Alternatively, you can compute the marginal likelihood, integrating out alpha1. The integral has a recurrent closed-form solution because everything is normal distributions. You get rid of N latent variables and a funnel-shaped linkage between alpha1[2] and g[2] at the cost of a much more complex likelihood computation.

Nice! Thanks for the clarification.