Funnel, transformed to unconstrained space

I hadn’t quite understood the density the funnel example (as described in the user’s guide chapter on reparameterization). Or of the full implications of using a lognormal as Radford Neal recommends as opposed to the half normal that Andrew Gelman recommends.

To keep things simple, I’ll use a slightly modified version without worrying about constant choices—the geometry’s the same. There are two models in play. I’ll start with Radford’s version then move on to Andrew’s.

Neal’s funnel

Radford’s original funnel (with simplified constants) is

\begin{array}{rcl} \tau & \sim & \textrm{normal}(0, 1) \\ \beta & \sim & \textrm{normal}(0, \exp(\tau)) \end{array}

The reason we care about this is that it’s a hierarchical prior for a regression model, where there would be a group of coefficients \beta_1, \ldots \beta_K; we’ve just simplified here to a single coefficient, taking K = 1, as we only care about the marginal prior of \tau and one of the \beta_k.

An alternative way of writing this without an explicit transform is

\begin{array}{rcl} \sigma & \sim & \textrm{lognormal}(0, 1) \\ \beta & \sim & \textrm{normal}(0, \sigma) \end{array}

The lognormal avoids zero in that

\lim_{\sigma \rightarrow 0} \textrm{lognormal}(\sigma \mid 0, 1) = 0.

The convergence is fast, as we’ll see in the plots below.

Gelman’s funnel

If we follow Andrew’s latest recomendations for hierarchical modeling, we stay away from zero-avoiding priors and code something like this,

\begin{array}{rcl} \sigma & \sim & \textrm{normal}(0, 1) \, \textrm{T}[0, ] \\ \beta & \sim & \textrm{normal}(0, \sigma) \end{array}

where I’ve used Stan-like notation to indicate the half-normal prior on \sigma that is consistent with zero, in that

\lim_{\sigma \rightarrow 0} \textrm{normal}(0, 1) \, \textrm{T}[0, ] \ = \ 2 \cdot \textrm{normal}(0, 1).

Unconstrained funnels

Neal’s funnel is naturally on the unconstrained scale p(\tau, \beta), with support \tau, \beta \in \mathbb{R}. That’s because \tau = \log \sigma.

Gelman’s funnel on the other hand, needs to be transformed, with \tau = \log \sigma and inversely, \sigma = \exp(\tau), so that the Jacobian adjustment is

\left| \frac{\textrm{d}}{\textrm{d} \tau} \exp(\tau) \right| \ = \ \exp(\tau),

resulting in the unconstrained density

p(\tau, \beta) = \textrm{normal}(\exp(\tau) \mid 0, 1) \cdot \exp(\tau) \cdot \textrm{normal}(\beta \mid 0, \exp(\tau)).

Modes?

I had naively thought that neither version of the funnel had a mode. It turns out that Neal’s funnel, because of the zero avoiding behavior, really does have a mode.

If we have a mode (\tau^*, \beta^*), it must have \beta^* = 0, because \beta \sim \mbox{normal}(0, \exp(\tau)). So let’s look at the functions p(\tau, 1) for Gelman’s and Neal’s funnels.

Neal’s funnel mode

Neal’s funnel actually has a mode. Here’s a plot with \beta = 0 fixed,


Plot of Neal’s funnel p(\log \sigma, \beta = 0). There is a clear mode around -1.

Hmm. I thought it grew without bound, but that lognormal prior on \sigma is really strong!

Gelman funnel mode

Surely without the zero-avoiding prior, there’s no limit to the log density of the hierarchical prior, right? I may even be recorded saying so.

I take it all back. Here’s Gelman’s, where \log \sigma = \tau.


Plot of p(\log \sigma, \beta = 0) for Gelman’s funnel. It asymptotes a little beyond \log \sigma = -5, or around \sigma < 0.005.

The plot actually looks almost identical on the log scale the dropoff is so sharp. The Jacobian \exp(\tau) is keeping things balanced (you can work out the limits analytically pretty easily now that I wrote out the density with the Jacobian adjustment).

So the unbounded growth is only a problem for penalized maximum likelihood estimates. When we take a max a posteriori (MAP) estimate, we get the Jacobian term, which cancels the unbounded growth in \textrm{normal}(\beta \mid 0, \sigma) as \beta \rightarrow 0, \sigma \rightarrow 0.

We get a proper MAP estimate for Neal’s funnel, but we only get an asymptote with Gelman’s funnel.

Methodology going forward

Should any of this make us rethink the priors we recommend? Look how much more nicely behaved that density is in the Neal formulation. Wouldn’t that be sooo much easier to sample from using NUTS?

Maybe we can recommend these zero-avoiding priors and then recommend that users look for pileups of probability mass and/or divergence issues as \sigma \rightarrow 0 (that is, \log \sigma \rightarrow -\infty.)

Among others, I’d love to hear from @betanalpha, @andrewgelman, @anon75146577, @avehtari, @bgoodri, and @yuling.

3 Likes

Joint modes aren’t that interesting for hierarchical models. Typically you want the more of sigma | y and even the mode of beta | sigma,y.

Also I’m not sure how you’re getting the limit of a density to be negative infinity when densities are non-negative.

Joint modes aren’t that interesting for hierarchical models.

Perhaps not, but the geometry of the posterior is. And that’s what I was basically confused about.

I’m not sure how you’re getting the limit of a density to be negative infinity when densities are non-negative.

Because I’m so used to Stan I think every density is on the log scale! I fixed it in the text.

I’m guessing the lognormal prior would be much better behaved in terms of the posterior geometry (less sharp of a funnel). The question is really how bad do you or others think it would be to use zero-avoiding priors like lognormal when parameters are expected to be unit scaled and we’re not expecting zero hierarchical variance? If we don’t know whether to pool or not, we can just look at the posteriors and see if they’re trying to completely pool even if our lognormal won’t let them.

Stan doesn’t need to find the mode (or even explore around it), so there’s some interest here, but I’m not sure what information to bring from it. The iterative mode finding (finding the mode of the \sigma \mid y and then \beta \mid \sigma,y) gives us information about the shape of the marginals, so the lesson is that the marginals are well behaved, while the joint is a bit odd. But I’m not sure how that effects the typical set.

My experience with non-local priors (see https://statmodeling.stat.columbia.edu/2018/05/04/zero-excluding-priors-probably-bad-idea-hierarchical-variance-parameters/) was that it was very hard to exlude eonough of the neighbourhood around zero to meaninfully fix the divergences, but if we did do so it massively shifted the posterior to the right, which I’d only be comfortable doing if we had a substantive reason to believe that values near zero could not happen (see Cromwell’s rule)

Just to be super clear, this experiment definitely isn’t definitive on the topic, but it was enough to convince me that it wasn’t worth going further. A more detail-oriented person would’ve done a simulation study with known parameters and checked bias and coverage :p

The problems with the funnel is a problem with only using first-order information. @betanalpha did some super-cool calculations that showed that the re-parameterization is equivalent to running Riemannian Manifold HMC with a specific, non-constant mass matrix. So this suggests to me that the more principled fix to the geometry problem (compared to changing the model) is to use second-order information in the sampler. To change the model without modelling reasons to do it just feels like the tail wagging the dog.

Edit (so not to spam the thread): My experiments used a Gamma (which goes to zero slower than a lognormal near zero) and an inverse Gamma (which goes to zero much faster than the log-normal near zero). They both failed but it’s possible that a log-normal would do better so I checked. here is the equivalent table.

In all these experiments, the lognormal prior has mean given in the first column (the lower levels are quite silly!) and standard deviation 1. The code is in the git repo if you want to try some more values of the standard deviation (there might be a magic combo that makes sense, but I’m pretty lazy).

I’m seeing the same sort of behaviour: the values priors that remove the geometry problem massively push the estimates of the group variance to the right. For comparison, the values of \tau you get with a half-N(0,5) prior on the standard deviation (and a non-centred parameterization) are (0.13, 2.69, 9.23) [(0.025,0.5,0.975)-quanitles].

mu prior median 0.01 quantile divergences low_BFMI 2.5% for tau 50% for tau 97.5% for tau
0.00 1.00 0.10 170 FALSE 0.34 1.16 5.57
0.50 1.65 0.16 497 TRUE 0.38 1.57 7.69
1.00 2.72 0.27 179 FALSE 0.81 2.55 10.58
1.50 4.48 0.44 149 FALSE 1.03 3.27 11.46
2.00 7.39 0.72 57 FALSE 1.11 4.43 14.33
2.50 12.18 1.19 69 FALSE 1.12 5.04 16.36
3.00 20.09 1.96 16 FALSE 1.92 6.67 18.85
4.00 54.60 5.33 10 FALSE 2.39 8.96 23.65
5.00 148.41 14.49 1 FALSE 4.23 11.86 30.65
6.00 403.43 39.40 1 FALSE 5.94 15.22 38.94
7.00 1096.63 107.09 0 FALSE 7.72 18.86 48.33
8.00 2980.96 291.10 0 FALSE 9.62 23.68 68.11
9.00 8103.08 791.28 0 FALSE 12.01 30.88 91.27
10.00 22026.47 2150.92 0 FALSE 14.46 40.34 145.52

Hi, yes, Radford is using a zap (zero-avoiding prior). If you use a zap, you’ll get better behavior for mode-finding and also for posterior sampling. If you are going to use a zap, I’d recommend gamma(2) rather than lognormal. gamma(2) avoids zero but in a gentle way which allows the posterior to be arbitrarily close to zero. lognormal is more of a hard constraint away from 0. See further discussion here: http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf
Alternatively, you can use a hard-avoiding zap such as lognormal, but then you need real prior information. Real prior information might not be hard to find (e.g., for a problem on unit scale, maybe 0.01 is a reasonable default hard lower bound for most problems), but then you need to be clear that’s what you’re doing.

Half-normal is of course no zap at all, and I use it from our usual 8-schools tradition of wanting the region near zero to be a possibility.

I also like that half-normal is constraining on the upper end: In practice when the number of groups is small, we sometimes run into problems with the posterior allowing very large values of tau. This does not cause convergence problems, but it can cause unrealistic posterior predictive simulations. But that’s a separate issue: here we’re considering the lower end.

Our usual recommendation is to use weak priors and then, if there are convergence or identification issues, introduce stronger priors using some real prior information. Your (Bob’s) proposal is to start with stronger default priors. If you want to start with a zap, the gamma(2) might be a safer default choice than the lognormal. But lognormal’s ok as long as the user is (a) willing to put in some prior information, and (b) not trying to test whether tau=0 fits the data.

Andrew

P.S. Ccing Yuling, who likes zaps, and Dan, who hates them.

1 Like

Just a quick comment. Joint mode and funnel can even have the opposite behavior: as I showed in my simulation, in models quite like “Gelman’s funnel”, increasing group variance eventually eliminates funnel as tau is kept away from 0 but also leads to bimodality, in contrast decreasing group variance will avoid bimodality but gives rise to the funnel/metastability/slow mixing of HMC chains.

This second part of this statement is not true for 8 schools and anything but the most extreme ZAP (if you measure better behaviour in sampling as there being no divergences)

I don’t think there’s any disagreement about the zero-avoiding nature of Radford’s original funnel example, but given that the original funnel still divergences I think if anything it demonstrates how problematic the geometry near zero is. Even if you have strong prior information on the minimal heterogeneity scale – most people have trouble working out the maximal scale and the minimal scale is even harder to reason about – it has to be really strong to avoid the pathological geometry. In most cases we are uncertain about how much heterogeneity there is and we should take the general PC prior perspective – set the simplest model (homogeneous behavior) at zero and let the prior regularize the devotion away from that simpler model.