After playing around with a number of different hierarchical models, I have determined that the problems I am working on are particularly sensitive to my selection of priors (particularly when there are fewer groups at levels 2 and 3 of the model). These findings are consistent with what is available in statistical literature, but that leaves me wondering what my options are for priors.
For the models I am estimating, non-centered parameterization is necessary to approach adequate recovery of the mean structure, and so I’m wondering what options I have for priors on the random effects, and how the use of a non-centered parameterization might (or might not) limit my options.
Reading the Stan manual and sections on non-centered parameterizations and also looking across this forum (and insights from @bgoodri, @betanalpha, @bbbales2, and others), it appears that where the means at each level are assumed to be normally distributed, the following are candidate priors that could be placed on the standard deviations:
My questions are as follows:
Is my reading of the available materials/literature correct, and these are all reasonable candidat priors?
Would it also be reasonable to possibly place an Inverse Gamma(.5,.5) prior on the standard deviation?
Would it be reasonable to consider placing a gamma prior on the inverse of the variance (I have seen some examples of this in code examples transitioning older BUGS code to RStan on GitHub)?
Are there any other prior distributions that I may want to consider (not considering those that might be data-augmented/informed)?
I would say they could be admissible, but only you can say if they are reasonable. My guess would be that these families could be reasonable for some values of the hyperparameters but not necessarily the ones you have listed. You should draw from the prior predictive distribution and see whether things look reasonable based on what you believe about the data-generating process.
That is pushing it. You have two shape parameters but no moments, so all you are really doing is pinning down the prior mode at 1/3.
No. That is a BUGS-ism and has no general justification outside a Gibbs sampler.
That is consistent with what I was seeing in some initial trial runs, though I didn’t attribute that to the selection of the prior. I found that the scale of the random effects for the highest levels of the models got pinned at, and then close to the mode- and the random effects at level 1 of the model were positively biased. What you say makes good sense.
Also, thanks for the insight that placing a prior on the inverse of the variance is only justified in the context of BUGS/Gibbs sampling.
Your comment that there are an “infinite number” of priors that could be considered is well-taken. Perhaps my question wasn’t well-formed. Although there are an infinite number of available priors, it seems as though there is a much smaller set of “default selections” (either referring to what selections are available, by default, in statistical programs; or referring to the limited number of choices practitioners often select when applying Bayesian methods).
Perhaps a better question is if you have any suggestions for other families of priors (outside of those listed above) which may offer better starting points for estimating hierarchical models with few groups at higher levels.
The hyperpriors in hierarchical models are extremely important and yet are often taken for granted. In particular they require careful consideration of the relevant domain expertise to set well. The population variance in particular sets the scale for the possible heterogeneity in the population.
Zero-avoiding priors prevent there being no heterogeneity in the population. This can be useful if you previous experiments or theoretical knowledge enforce a certain known amount of heterogeneity. To be honest, however, situations where this information is available are highly exceptional – even if you know that there is some heterogeneity it’s very hard to place a limit on just how much there has to be. Consequently I would not recommend zero-avoiding priors.
Zero-including priors, like the half-Normal, half-Cauchy, and exponential are nice in that they include the no heterogeneity case. Consequently the corresponding hierarchical model can be considered an expansion around the simpler model with no-heterogeneity which helps limit overfitting and other modeling problems. The main differences in these priors lie in their tails. The half-Cauchy is a problem – the tail is way to heavy and permits models with extreme heterogeneity. In fact the heaviness of the tail actually drags the posterior up to those extreme values even if the data don’t quite need them. This leaves the half-Normal and exponential. The exponential has some nice theoretical properties (see @anon75146577 et al’s PC priors paper for some examples, https://arxiv.org/abs/1403.4630) but honestly in practice there isn’t a huge difference between the two. I like the half-Normal for it’s tighter tail but both work reasonably well in practice provided that you have chosen a reasonable scale below which you want to constrain the possible heterogeneity.
Ultimately each choice of prior introduces different information into the analysis and the prior that will be best for your application will depend on what information is available to you and what information your likelihood needs to ensure that the posterior is well-behaved (for example, fewer groups implies that you need more careful hyperpriors).
@betanalpha - Thank you very much for your comments and insights.
I have to admit that in the small sample literature I have been reading (primarily psychometric literature), I have seen few mentions of the importance of hyperpriors. That said, perhaps I am not reading closely enough or authors may not be emphasizing their specification? I also recognize that different models (and estimation frameworks) have different levels/degrees of traction within different disciplines, so if there might be another paper that is particularly good at highlighting the importance of hyperpriors - I welcome any and all suggestions! I will absolutely look into the @anon75146577 et al paper.
In the initial explorations that I have done - this is certainly consistent with what I have seen, I think. I have estimated the same model specifying half-Normal and half-Cauchy priors, with the latter specification resulting in higher numbers of divergences and also greater bias both in the estimation of variances and in the estimation of fixed effects.
Nope – unfortunately discussion of this sort is in general lacking in most applied literature! Dan’s paper contains some really important insights and some of us are trying to write more on this topic as quickly as we can!
Great. In general I’d recommend starting with simple priors (and light tails) and adding structure only if you need it (using posterior predictive checks to guide the identification of said needs).