Do I need Jacobian adjustment here?

One way to reframe this is is to emphasize that Bayesian inference requires a prior distribution over the entire model configuration space. That can be constructed from independent priors on each of the nominal parameters, but such a prior is often not consistent with the entirety of our domain expertise. We not only need to carefully investigate the consequences of the prior model but also recognize that concepts like “informative” are context dependent and a prior that seems “noninformative” in some directions can be very informative (in bad ways) in other directions.

This is the critical point and why statements like “putting a prior on a generated quantity” don’t actually make sense, and hence in my opinion why they are extremely dangerous.

In Bayesian inference we never actually “assign a prior” to individual variables. Such a concept is not parameterization invariant and the posterior distribution never actually utilizes any such independence structure – it’s constructed only from the joint prior distribution over the entire model configuration space.

The only way to formalize a concept like “assigning a prior to a variable” is to consider the marginal behavior of the joint prior model. In particular we can say that a prior model \pi(\theta) assigns a prior \pi(\vartheta) to the variable \vartheta = f(\theta) if \pi(\theta) pushes forward (or marginalizes) to \pi(\vartheta). As @jsocolar notes, however, the heuristic often advertised as “putting a prior on a variable” does not actually achieve this in any mathematically self-consistent way.

What’s actually happening under these heuristics is the specification of a prior model through a non-generative prior density function. Instead of specifying the prior with conditional probability density functions consistent with the generative structure of the model the prior is specified with some arbitrary function of over the entire model configuration space. That function might include component functions that depend on only one variable at a time, but those functions cannot be interpreted as priors in any meaningful way; they are functions over generated quantities, not interpretable priors. Note that this is one place where the ~ notation in the Stan Modeling Language can be abused.

a ~ normal(0, 1);
f(a) ~ normal(0, 1);

is a valid Stan program but neither line has a meaningful probabilistic interpretation on their own. Avoiding the syntactic sugar,

target += normal_lpdf(a | 0, 1);
target += normal_ldpf(f(a) | 0, 1);

better communicates that we’re just building up a joint function over the parameter space.

Technically there’s nothing wrong with heuristically building the joint prior density function at once. Ignoring the generative structure can make it harder to build reasonable prior models but with enough prior checking useful heuristics can certainly be developed. In other words while this approach may not be recommended, especially as a default, it could be useful in certain circumstances. At least when interpreted correctly and never thought about as “putting priors on generative quantities”.

Another fun example for anyone who is interested in empirical investigations is

parameters {
  ordered[10] x;
}

model {
  x ~ normal(0, 1);
}

The interaction with the constraint and this non-geneative prior has some counterintuitive behaviors!

Discussion about discussion from different perspectives is always beneficial so I would be happy for anyone to write a blog post.

In my opinion case studies are tricky because they carry an heir of authority to them which can be misinterpreted without the right context (for example in my opinion the “Prior Best Practices” document is a mess because it aggregates too many heuristics without the needed context).

5 Likes