One way to reframe this is is to emphasize that Bayesian inference requires a prior distribution over the entire model configuration space. That can be constructed from independent priors on each of the nominal parameters, but such a prior is often not consistent with the entirety of our domain expertise. We not only need to carefully investigate the consequences of the prior model but also recognize that concepts like âinformativeâ are context dependent and a prior that seems ânoninformativeâ in some directions can be very informative (in bad ways) in other directions.
This is the critical point and why statements like âputting a prior on a generated quantityâ donât actually make sense, and hence in my opinion why they are extremely dangerous.
In Bayesian inference we never actually âassign a priorâ to individual variables. Such a concept is not parameterization invariant and the posterior distribution never actually utilizes any such independence structure â itâs constructed only from the joint prior distribution over the entire model configuration space.
The only way to formalize a concept like âassigning a prior to a variableâ is to consider the marginal behavior of the joint prior model. In particular we can say that a prior model \pi(\theta) assigns a prior \pi(\vartheta) to the variable \vartheta = f(\theta) if \pi(\theta) pushes forward (or marginalizes) to \pi(\vartheta). As @jsocolar notes, however, the heuristic often advertised as âputting a prior on a variableâ does not actually achieve this in any mathematically self-consistent way.
Whatâs actually happening under these heuristics is the specification of a prior model through a non-generative prior density function. Instead of specifying the prior with conditional probability density functions consistent with the generative structure of the model the prior is specified with some arbitrary function of over the entire model configuration space. That function might include component functions that depend on only one variable at a time, but those functions cannot be interpreted as priors in any meaningful way; they are functions over generated quantities, not interpretable priors. Note that this is one place where the ~
notation in the Stan Modeling Language can be abused.
a ~ normal(0, 1);
f(a) ~ normal(0, 1);
is a valid Stan program but neither line has a meaningful probabilistic interpretation on their own. Avoiding the syntactic sugar,
target += normal_lpdf(a | 0, 1);
target += normal_ldpf(f(a) | 0, 1);
better communicates that weâre just building up a joint function over the parameter space.
Technically thereâs nothing wrong with heuristically building the joint prior density function at once. Ignoring the generative structure can make it harder to build reasonable prior models but with enough prior checking useful heuristics can certainly be developed. In other words while this approach may not be recommended, especially as a default, it could be useful in certain circumstances. At least when interpreted correctly and never thought about as âputting priors on generative quantitiesâ.
Another fun example for anyone who is interested in empirical investigations is
parameters {
ordered[10] x;
}
model {
x ~ normal(0, 1);
}
The interaction with the constraint and this non-geneative prior has some counterintuitive behaviors!
Discussion about discussion from different perspectives is always beneficial so I would be happy for anyone to write a blog post.
In my opinion case studies are tricky because they carry an heir of authority to them which can be misinterpreted without the right context (for example in my opinion the âPrior Best Practicesâ document is a mess because it aggregates too many heuristics without the needed context).