In the Stan wiki there is currently this quote here.
Aki prefers student_t(3,0,1), something about some shape of some curve, he put it on the blackboard and I can’t remember
Does anyone know what that special “something” is, or can help the author remember? Trying to learn specifically about StudentT priors and very interested in the reasoning behind this one
I don’t remember why Aki liked 3 in particular for the degrees of freedom, but the appeal of the t-distribution for priors is that the degrees of freedom parameter functions like a dial you can turn to make the distribution more Cauchy-like (df -> 1) or more Gaussian-like (df -> Inf). A small degrees of freedom like 3 will result in a prior with pretty fat tails compared to a normal distribution but not so extreme as the Cauchy.
Having degrees of freedom greater than 1 ensures a finite variance and having it greater than 2 ensures a finite mean, as well. So between 2 and 3 is when your tails are only incredibly heavy but not obscenely heavy, which is what I would guess was Aki’s reasoning. People argue about values ranging from 3 to 7, and some charlatans like me just stick with normals because they feel that the benefits of heavier tails don’t outweigh their complications.
Could you expand a little bit on your reasoning? From your case study, I took away that the normal could be too constraining, compared to heavier tailed priors. At least that is what I conclude from the failure mode sections.
Yes. The idea is that thick tail reflects our uncertainty in the prior scale and if we have underestimated the prior scale thick tail easier detection of prior-likelihood conflict. I had good experience with t_3 or t_4 when working a lot with GPs when sometimes thick tail really described our prior information for some covariance function parameters well. See O’Hagan (1979). On Outlier Rejection Phenomena in Bayes Inference, JRSSB, 41(3):358-367, for more on benefits of thick tail in case of Student’s t.
Possible complication that I know (or remember) is computational issues. Heavy tailed prior and weak information from likelihood (due to weak data or weak identifiabilities in parameterization) can lead to heavy tailed posterior. For example, dynamic HMC used in Stan is much better than many other MCMC algorithms for sampling from thick tailed distributions but still has problems as least in case of Cauchy. The nice property that the thick tailed prior is robust in case of misspecified scale, can also lead to multimodality which can also cause computational problems. I also recommend normals (and half-normals) because of these computational issues. This is especially recommended when you have good prior information on the scale or if you know that the result is not going to be sensitive if you set the scale to a much larger value. Using normal prior changes a bit how to diagnose the misspecified prior scale, but that is not a big issue.
The weakly-informative priors that we typically talk about enforce “containment” of the posterior. The shape of the prior density impacts just how strong this containment is.
Lighter tails, like Gaussians, offer stronger containment. This prevents the posterior from stretching to extreme regions of parameter space but if the scale is wrong then that prior containment can conflict with the likelihood.
Heavier tails, like the Cauchy, offer very weak containment. This weaker containment offers less resistance to the likelihood, so in the case of an overaggressive scale the likelihood can still dominate the posterior. On the other hand, in the case of a diffuse likelihood the posterior will follow those heavy tails towards more extreme values. This may not sound bad but most people have trouble grasping just how heavy those tails are! A Cauchy density with location 0 and scale 1 has appreciable mass stretching all the way out to 100, and even a bit near 1000! The model configurations out in those tails can be all kinds of problematic – for example they might cause intermediate calculations like ODE solvers or algebraic solvers to fail.
Stan will sample from a Cauchy just fine, so it’s not the heavy tails themselves that worry me but rather what the extreme model configurations far in the tails can do the overall stability of the model. I much prefer the safety of the stronger containment of the Gaussian coupled with a careful analysis of the posteriors shapes relative to the prior shapes to identify any misplaces scales (which is pretty straightforward to see). But, again, that’s just my opinion. Everyone approaches modeling differently.