Prior recommendation for scale parameters in hierarchical models too strong?

In the Stan documentation on GitHub, the half-normal(0,1) or half-t(4,0,1) are recommended as default choices for the prior of scale parameters in hierarchical models. Is the choice of 1 as sigma in the half-normal and half-t broadly applicable for most problems or should it be larger (e.g. 10)? For example when used as a scale parameter in Hierarchical logistic regression, I find that the mean of the scale parameter (from posterior samples) tend to be much larger than 1 and often exceed 10.

Also, why is it desirable to use a prior where the mode is at zero (half-normal and half-t). Would a distribution with a positive mode be better?

Those priors are assuming that the parameters are all on a roughly unit scale. If you have bigger parameter values, you’ll need bigger priors to be consistent. Or, you can rescale the parameters, which can simplify Stan’s adaptation.

If you have that information, then yes. But it probalby won’t make much difference until you get into more informative priors.

The main point, as Andrew makes in the papers I linked in the manual discussion of priors (in the regression chapter) is that there’s a big differenence between a prior that’s consistent with zero (like half-normal) and one that’s not (like lognormal); we generally want priors that are consistent with zero unless we know the value’s not going to be zero.

1 Like

I see, thanks for replying so quickly. How about the case where we know the value is more likely to be positive (say 1) than zero? Would a noncentral t with its center at 1 (and truncated at 0) be more appropriate than a half-t or half-normal?

A half-normal has an expected value of 0.78.

mean(abs(rnorm(1000, 0, 1)))

A half-normal cauchy has an expected value of 1.
A half-student-t with df=4 has an expected value of 1.

mean(abs(rt(1000, 4)))

Yes, but for these distributions the density is higher at 0 than 1, even though we would like to include the information that 1 is more likely than 0.

I think you would want to look into boundary avoiding priors like the gamma mentioned in the prior wiki. I think the gamma is not consistent with zero.

I know what you are saying here but this is maybe not the best way to think about continuous prior distributions. I try to think more as setting a prior with a 50% chance greater than 1 and a 95% smaller than 5 (for instance).

1 Like

Thanks for providing the link @stijn. What do you mean by “gamma is not consistent with zero”?

Boundary avoiding priors seem to put too little weight on values near zero. Can a truncated noncentral t be used to solve this?

I was not sure either so I looked at the paper. I am still not sure but …

http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf

The key thing is probably Figure 3. The Gamma(2, 0.1) prior has no extreme curvature close to 0. So (the paper argues that), the prior is unlikely to dominate the likelihood (compared to other priors). The wiki also says that

which will keep the mode away from 0 but still allows it to be arbitrarily close to the data if that is what the likelihood wants

I guess that means you do not have to worry too much about putting too little weight around zero.

The truncated non-central t-distribution could be another option. I would be reluctant because it feels like you could be influencing the posterior more than you want by the prior choice (more feeling than math). If you bump into computational problems because the prior is too narrow, I think it should be enough to choose a higher scale for the half t or half normal (or maybe a lower df for the t).

You got the point: Are we steering to a bunch of models, people using too narrow priors, because
it fits the model so nicely or some reference paper use it, like inv-gamma(0.01,0.01)
in history and are we at risk of creating a p-value similar problem in Bayesian estimation, when this
technology is used by people with insufficient knowledge?

OK thanks @stijn. I am now convinced that there will be sufficient weight around zero have decided to go with a gamma prior.

1 Like

Yes, there is a risk that people might use priors just because they see it in some reference, without questioning the appropriateness of it. Perhaps there should be more qualifications when providing recommendations for priors, e.g. the scale parameter should chosen based on what is appropriate for your problem.

There are two things that can be very misleading here.

  1. Posterior probability mass is what matters, not density. Mass is density times volume (really integrated over the volume). Usually you don’t sample anywhere near the high density areas in high dimensions. For more intuition building exercises, see my case study:

http://mc-stan.org/users/documentation/case-studies/curse-dims.html

Now in the case of the half normal, the range [0, 0.5] is higher probability than the range [0.5, 1.0], so that’s not the issue here. But what is the issue is something like Pr[sigma < 1] winds up being pretty similar to Pr[sigma > 1] with a half standard normal. This is the kind of thing we want to concentrate on, not where the mode is.

  1. Posterior means get shifted by truncation. So even though the mode is at zero for a half-normal, the mean is pushed to the right. In general, when you truncate on the left, mass gets redistributed and means shift to the right. That’s how the mean of a half-normal(0, 1) is near 1. This is why truncated interval priors can be so biased compared to their untruncated sources.

What we recommend instead is scaling the parameters and using the default priors. If you can’t scale the parameters, then you definitely need to scale the priors.

As the value approaches zero, so does the density:

lim_{s -> 0} gamma(s | alpha, beta) = 0.

Lognormal has the same property. Andrew’s papers cover the rate at which the density approaches zero and what effect that has on the posterior.

3 Likes

I see, thanks for explaining that @Bob_Carpenter!

For alpha > 1. In my opinion, one of the better reasons to use the gamma distribution as a prior is that you can make it take a variety of shapes.

These are bad priors. They are good penalties and Andrew was sloppy not distinguishing between the two (there’s a reason that paper only uses them for penalized maximum likelihood).

They are equivalent to saying that you have strong substantive prior knowledge that the standard deviation cannot be near 0.

This is the best advice. Put the prior on the standard deviation, and make sure the prior is scaled appropriately for the data. I don’t love half normals, I’d prefer a half-t_7 or an exponential prior. You calibrate it so that, if U is an upper bound for the random effect [which is often 1 if the model is scaled], Pr(standard_deviation > U) = 0.05.

This type of tail bound gives some robustness against you specifying U to be too small, but produces a posterior that will only concentrate above U if there is strong evidence in the data.

3 Likes

which is appropriate for most of hierarchical modeling in biology/medicine/political science/sociology so I have trouble seeing why you would call that a bad prior (?)

I just don’t agree with that at all. Biology/medicine/poly sci/sosc all are full of situations where signals are small or group effects are non-existent (complete or almost complete pooling are better than medium or no pooling) both of which require some mass near zero. Also from a prediction point of view, shrinkage priors give better prediction.

(Sorry for the multiple replies): The reason for adding a penalty to maximum likelihood is that the mode of the posterior with uniform or log-uniform priors is often at the boundary of the parameter space, which results in a sharp zero (rather than a weak zero that is more appropriate), which means that without boundary avoiding penalization, pMLE systematically under-fits. But that is not a problem you encounter when using Bayes.

[quote=“Daniel_Simpson, post:17, topic:2927, full:true”]
I just don’t agree with that at all. Biology/medicine/poly sci/sosc all are full of situations where signals are small or group effects are non-existent (complete or almost complete pooling are better than medium or no pooling)
[\quote]

This would be a fun conversation to have in person but I suspect over discourse we’re not going to get anywhere.

ok