Zaps to avoid the funnel

Zap = zero-avoiding prior
funnel = the funnel (as in the 8 schools), also called the whirlpool or the bad geometry of the hierarchical model with weak data

We had this blog discussion awhile ago about one of my favorite new ideas:


Which Dan Simpson hates:

I was talking with Yuling and was reminded how much I love this idea, so I’m just floating it to the top so we can remember to have a discussion about it sometime.

Example of a zap for the 8 schools. The gamma(2, 2/50) prior for tau starts out linear at 0 then declines eventually far out in the distance (see this paper for background: http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf). Here’s the Stan model, which

data {
int J; // number of schools
vector[J] y; // estimated treatment effects
vector<lower=0>[J] sigma; // s.e.’s of effect estimates
}
parameters {
real mu;
real<lower=0> tau;
vector[J] theta;
}
model {
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
tau ~ gamma(2, 2./50.); # alpha=2, expectation = 50
}

I ran this on the 8 schools and, to be honest, it didn’t remove all the geometry problems but I think it’s a lot better than the centered parameterization without the prior. And, yes, you could just use the non-centered parameterization here, but the point here is to try to avoid having to do that, given that the ncp can have its own issues (for example, when data are rich).

This was my experience. (Well, s/all/any of/g)

All my experiments are here: https://github.com/dpsimpson/boundary_avoiding_priors

Another quick proposal:
Classical results say that posterior is robust/insensitive to hyper-prior, but sometimes Zap will bias the posterior if there is indeed a posterior mass around 0. As Dan wrote in blog post:

It gives you a massively different set of values

In light of simulated-tempering, we can use importance sampling to adjust that bias, except we don’t have any sample around 0.

So how about averaging two models: tau =0 and tau ~ gamma (2, 2/50). Since we have already include tau=0 in the first situation, I might even suggest a more informative Zap, say inv-gamma(5,5).
If

Then inv-gamma(5000,5000) I guess.

A naive implementation of tempering, or importance sampling, or indeed BMA (they should be equvalent in this context), will fail because of the non-overlaped posterior energy spreads of these two models (which I guess can be an alternative definition of funnel?). But we also have stacking in the tool box.

To sum-up: adjust bias of Zap by tempering; replace importance sampling with stacking.

Yes, I like the idea of zap plus model averaging. This is in the spirit of my blog post where I suggested that we consider the funnel as a sort of discontinuous or multimodal posterior–it’s not actually multimodal, it doesn’t have 2 different modes, but it does have 2 different zones of curvature–and making the discreteness overt, as it were. The only tricky thing will come when we have many group-level variance parameters, as this gives us a mixture over 2^K modes.

1 Like

I’m still not seeing the point of forcing a prior-data conflict that doesn’t fix the problem. I don’t see what you’re hoping for.

I’m hoping to avoid the damn funnel.

1 Like

But it doesn’t. The funnel is consistent with the data in low data regimes so avoiding it will cause a prior data conflict

Very much a baby with the bath water situation. The 3 other solution (reparameterisation, riemannian, and [approximate] marginalisation) don’t have that problem.

2 Likes

The bath water’s so dirty that I’m willing to throw out some of the baby. To put it another way, I don’t think much is lost in the application by bounding tau away from 0.

I’m not 100% sure of myself here, I just like the idea.

1 Like

I guess one motivation is that there are some situations where we do not know how to reparametrize (e.g., non-location-scale-family).

1 Like

Indeed Zap is moderately biased in eight-school.zap.pdf (5.1 KB)

For the model averaging part (average tau=0 and zap), what is the prediction quantity here? The individual data is lost so what we can do is essentially leave-one-group-out.

The easiest solution is to assume sigma is given and to predict y_i, although the concern is that loo from 8 points is dangerous.