Prior recommendation for scale parameters in hierarchical models too strong?


In the Stan documentation on GitHub, the half-normal(0,1) or half-t(4,0,1) are recommended as default choices for the prior of scale parameters in hierarchical models. Is the choice of 1 as sigma in the half-normal and half-t broadly applicable for most problems or should it be larger (e.g. 10)? For example when used as a scale parameter in Hierarchical logistic regression, I find that the mean of the scale parameter (from posterior samples) tend to be much larger than 1 and often exceed 10.

Also, why is it desirable to use a prior where the mode is at zero (half-normal and half-t). Would a distribution with a positive mode be better?


Those priors are assuming that the parameters are all on a roughly unit scale. If you have bigger parameter values, you’ll need bigger priors to be consistent. Or, you can rescale the parameters, which can simplify Stan’s adaptation.

If you have that information, then yes. But it probalby won’t make much difference until you get into more informative priors.

The main point, as Andrew makes in the papers I linked in the manual discussion of priors (in the regression chapter) is that there’s a big differenence between a prior that’s consistent with zero (like half-normal) and one that’s not (like lognormal); we generally want priors that are consistent with zero unless we know the value’s not going to be zero.


I see, thanks for replying so quickly. How about the case where we know the value is more likely to be positive (say 1) than zero? Would a noncentral t with its center at 1 (and truncated at 0) be more appropriate than a half-t or half-normal?


A half-normal has an expected value of 0.78.

mean(abs(rnorm(1000, 0, 1)))

A half-normal cauchy has an expected value of 1.
A half-student-t with df=4 has an expected value of 1.

mean(abs(rt(1000, 4)))


Yes, but for these distributions the density is higher at 0 than 1, even though we would like to include the information that 1 is more likely than 0.


I think you would want to look into boundary avoiding priors like the gamma mentioned in the prior wiki. I think the gamma is not consistent with zero.

I know what you are saying here but this is maybe not the best way to think about continuous prior distributions. I try to think more as setting a prior with a 50% chance greater than 1 and a 95% smaller than 5 (for instance).


Thanks for providing the link @stijn. What do you mean by “gamma is not consistent with zero”?

Boundary avoiding priors seem to put too little weight on values near zero. Can a truncated noncentral t be used to solve this?


I was not sure either so I looked at the paper. I am still not sure but …

The key thing is probably Figure 3. The Gamma(2, 0.1) prior has no extreme curvature close to 0. So (the paper argues that), the prior is unlikely to dominate the likelihood (compared to other priors). The wiki also says that

which will keep the mode away from 0 but still allows it to be arbitrarily close to the data if that is what the likelihood wants

I guess that means you do not have to worry too much about putting too little weight around zero.

The truncated non-central t-distribution could be another option. I would be reluctant because it feels like you could be influencing the posterior more than you want by the prior choice (more feeling than math). If you bump into computational problems because the prior is too narrow, I think it should be enough to choose a higher scale for the half t or half normal (or maybe a lower df for the t).


You got the point: Are we steering to a bunch of models, people using too narrow priors, because
it fits the model so nicely or some reference paper use it, like inv-gamma(0.01,0.01)
in history and are we at risk of creating a p-value similar problem in Bayesian estimation, when this
technology is used by people with insufficient knowledge?


OK thanks @stijn. I am now convinced that there will be sufficient weight around zero have decided to go with a gamma prior.


Yes, there is a risk that people might use priors just because they see it in some reference, without questioning the appropriateness of it. Perhaps there should be more qualifications when providing recommendations for priors, e.g. the scale parameter should chosen based on what is appropriate for your problem.