Do I need Jacobian adjustment here?

Hi,

I am estimating a model where one variable of interest is obtained by dividing two other variables, of which one involves a non-linear transform.

Here is a pseudo-model trying to illustrate this:

data {
  int N;
  vector[N] X;
  vector[N] Y;
}
parameters {
  real a;
  real log_b;
  real<lower=0> sigma;
}
model {
    real b;
    real a_div_b;
    sigma ~ normal(0,1);
    a ~ normal(0,1);
    log_b ~ normal(0,1);
    b  = exp(log_b);
    Y ~ normal( a + b*Y, sigma);

    a_div_b = a / b;
    a_div_b ~ normal(0,1); // Jacobian adjustment?
}

It seems to me that I would need a Jacobian adjustment here, but I am not sure (I read the relevant section of the manual and remain unsure).

3 Likes

No. f(a, b) = \frac{a}{b} is non-invertible. You would need to extend f to be invertible, i.e. f_{e}(a, b) = (f(a, b), g(a, b)) where f_{e}: \mathbb{R}^{2} \to \mathbb{R}^{2} (or equivalent) for a Jacobian adjustment to make sense.

What you’re doing is putting a prior on a derived quantity, which is fine.

3 Likes

Another nice comment about it here: https://statmodeling.stat.columbia.edu/2019/08/23/yes-you-can-include-prior-information-on-quantities-of-interest-not-just-on-parameters-in-your-model/#comment-1106048

1 Like

Yes, it was clear to me that I can put a prior on a derived quantity.
What I was uncertain about is if one needs Jacobian adjustment when the derived quantity uses a variable that is non-linear transform.
Maybe the key is that I am using a non-linear transform not of a parameters, but of a variable calculated with a parameter and data? Intuitively, this I why I thought it might be OK.

The intuition is that you want to ‘incorporate some information about \frac{a}{b}, where a and b are parameters of the model’, not that you would like to 'incorporate information about some invertible function of a or b'.

Edit: Apologies, I had myself confused there and posted something incorrect, which I have since deleted.

1 Like

One way to look at it is that it depends on what you mean by “prior”. Technically by putting priors on a and log(b), the distribution of a/b is already fully defined. Putting another prior on it will change the distribution of a and log(b). Basically your “priors” are contradicting each other. You have two parameters but are using three “priors” (apart from sigma). So if what you mean by “prior” is the distribution that you would get if you leave out the likelihood, then that’s not what your code does. If however you see the “prior” on a/b as additional information on top that modifies the otherwise already fully prior-specified model than it’s correct.

I wonder if there is vocabulary to differentiate between these two different kinds of “priors”.

3 Likes

I was curious for general jacobian adjustments usecases and this was also helpful. So maybe

needed:

  1. priors that go beyond parameter block
  2. jacobian of reparameterization is nonconstant (eg. nonlinear transformation)
  3. jacobian of reparameterization is constant but Bayes factor is needed

not needed:

  1. priors restricted on parameter block
  2. jacobian of reparameterization is constant and only inference is needed
2 Likes

I see the prior on a/b as additional information.
The issue i observe is that i can get bimodal posterior distributions of the a/b quantity, which is of interest to me, if I don’t put an additional prior on it.
My intuition is that the problem occurs because a/b has a ratio distribution with undefined moments and heavy tails, which makes sampling difficult. Putting an additional prior on a/b makes the posterior of a/b “better behaved”.

Taken together, priors on parameters, priors on nonlinear transformations of parameters, priors on derived quantities, and Jacobian adjustments collectively yield the prior predictive distribution. If you’re getting the prior predictive distribution that you intend, then no additional changes are necessary.

The situation where Jacobian adjustments are unambiguously necessary is if you wish the prior predictive distribution for a nonlinear transformation f(x) to correspond precisely to the probability density function that you use as a prior on f(x). Thus, if you write f(x) \sim Distribution and you want the prior predictive distribution for f(x) to correspond to Distribution, then you definitely need a Jacobian adjustment. In general, you can only achieve this if f is continuously differentiable and one-to-one. On the other hand, writing f(x) \sim Distribution (without any Jacobian adjustment) still yields a prior, and if this happens to be a convenient trick to write down your desired prior on x then there’s nothing wrong with that. Likewise, if specifying priors for x and some derived quantity g(x) (with or without a Jacobian adjustment) yield the desired prior predictive distribution for x, there’s nothing wrong with it. However, in this situation:

  • you cannot expect the prior predictive distribution for x to correspond to the line of code that appears to place a prior on x
  • you cannot expect the prior predictive distribution for g(x) to correspond to the line of code that appears to place a prior on g(x)
  • I think (but I’d be happy to be corrected), that you can expect the prior predictive distribution for x to correspond to the normalized distribution proportional to priorPDF_1(x)*priorPDF_2(g(x)), with no adjustment.
  • likewise, I think that if g is one-to-one, then you can expect the prior predictive distribution for g(x) to be proportional to priorPDF_1(x)*\frac{priorPDF_2(g(x))}{JacobianAdjustment}. Thus, if you include the Jacobian adjustment in the model, the prior predictive distribution for g(x) will be proportional simply to priorPDF_2(g(x))*priorPDF_1(g^{-1}(g(x))) = priorPDF_2(g(x))*priorPDF_1(x)

(Sorry for all the edits; I had some things substanially wrong in the first iteration of this reply).

5 Likes

@jsocolar Thanks for the nice summary!

But a/b is not a model parameter, from the samplers point of view. It is a derived quantity. Thus its distribution has no effect on how difficult the model is for the sampler. If you have a sampling problem it must come from the variables which are actually sampled, which are a, log_b and sigma. If your prior on a/b changes how the sampler copes with the model it’s because it changes the distributions of these other variables.

What’s the problem with it being bimodal?

1 Like

Thanks @jsocolar this was very useful. I had not thought of putting this into the context of the prior predictive distribution.

1 Like

The problem is that the derived quantity is of interest, so that I’d like it to converge. Without the additional prior I get high Rhat values for it.

My expression “bimodal” was not optimal. It is more accurate to say that it can happen that a chain does not find the typical set, with which I mean that after warm-up the chain does not explore the region of the parameter space with the best log posteriors, but remains in a region with slightly inferior log posteriors.
This region corresponds to the (fat) tail of the ratio distribution described above (dotted blue line in the attached figure). Putting an additional prior on the derived quantity that works against this fat tail (dotted red line) improves sampling in that the chains reliably find/explore the typical set.

The attached figure shows the implied prior of the generated quantity in blue, the additional prior on the generated quantity in red, and the effective prior in black.

2 Likes

I don’t know how closely the pseudomodel you gave above corresponds to your actual model, but: What’s the reasoning behind the lognormal prior on b? As this is the only thing that differentiates the model from ordinary gaussian models (which are very well behaved) your problem probably has to do with that.

I guess that making the prior on log_b narrower probably would stabilize you model too.

Anyways, as you already marked you question as solved I assume that you don’t need further help? Or are you still trying to find a better solution?

I shouldn’t have introduced the lognormal prior, because it’s not really central to my question. Here is a hopefully clearer way to describe my question: I have three generated quantities of interest: a, b, and a/b, whereby

a = alpha_a + beta_a * X
b = exp(alpha_b + beta_b * X)

alpha_. and beta_. are parameters with normal priors and X is data. I use a log link function for b because it has to be positive for physical reasons. In addition, I am putting a lognormal prior on the generated quantity a/b.
I wondered if I might need a Jacobian adjustment, because I am putting a prior on a generated quantity that involves a non-linear transform of the parameter alpha_b.

Because many common misunderstandings (especially in some of the external links) have been passed along in the thread let me try to clarify a few very important points.

I know everyone hates probability theory but this is one of those cases where it can’t be ignored. A prior model is specified by a probability distribution. In practice we often specify that distribution with a probability density function, but the relationship between the prior distribution and the prior density function is complex. Specifically if we reparameterize the model the prior distribution transforms naturally but the prior density function does not. Indeed the weird transformation properties of a probability density functions are where Jacobian corrections arise. See for example https://betanalpha.github.io/assets/case_studies/probability_theory.html#42_probability_density_functions.

This tenuous relationship becomes even more subtle when we consider pushforward distributions which is the technical name for the induced distribution over a generated quantity. Prior distributions pushforward in a natural way, but prior density functions do not.

Evaluating prior pushforward distributions is a critical step in prior modeling because these distributions encode the, often unintended, consequences of a prior model built up form independent prior distributions over the nominal parameters. In other words when considering only the nominal parameters the prior model seems reasonably compatible with our domain expertise, but if we consider its consequences further down the generative model then we start to see inconsistencies with other domain expertise that we had not yet utilized. For much more see https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#11_domain_expertise_consistency and I.J. Good’s Good thinking.

The challenge then is implementing prior pushforward checks, and resolving any inconsistencies, in a mathematically self-consistent way. When evaluating prior pushforward checks the awkward transformation properties of prior density functions can be avoided entirely but utilizing samples, which do transform more naturally; again see the above case study for examples.

So from this context we can say that @Guido_Biele built an initial prior model for a and b only to notice that the prior pushforward model for a / b was inconsistent with the available domain expertise. This prior pushforward check could have been implemented with generative sampling or by working out an analytic probability density function for a / b which would involve an intermediate transformation, Jacobian correction, and then marginalization (for example if a and b were given independent normal densities then a/b would end up having a Cauchy density; when a and b themselves are derived this kind fo analytic calculation becomes much, much more difficult).

Now comes the hard part – how do we built a prior model that’s compatible with the available domain expertise on a, b, and a / b?

One option discussed it to try to “invert” a prior density function on a / b to define “corrections” to the prior density functions on a and b. The math, however, is very clear that this won’t work. For one the mapping (a, b) \mapsto a / b is not invertible, as @hhau noted, so there are infinitely many prior density functions on a and b that could achieve the desired pushforward behavior for a / b. For two, however, most of these inversions would not be compatible with the original domain expertise about a and b!

The problem with many of the heuristics suggestions based on the “priors over generated quantities are fine” is that they ignore this consistency problem. A prior over generated quantities changes the prior over the input variables, often in undesirable ways. Critically the mathematical effects of this change are moderated by that Jacobian correction which consequently cannot be ignored!

If one does want to attempt a heuristic prior model of this form then one has to rely heavily on prior pushforward checks, in particular pushing the heuristic prior model forward to each variable of interest (including parameter and intermediate quantities) to verify sufficient consistency. The challenge with this approach is that it’s difficult to resolve any inconsistencies that might be found – because the prior model doesn’t correspond to a well-defined generative structure it’s not clear how the different behaviors relate and hence how to change modify the initial heuristic prior model.

Instead I strongly recommend taking advantage of the existing generative structure and using any inconsistencies in the prior pushforward distributions to motive changes to the priors on each parameter. For example if the pushforward distribution of a / b is too diffuse then one could be more careful about the prior modeling for a and b, working out narrower independent prior density functions or even a positively-correlated prior density function that suppresses fluctuations of a up when b also fluctuates down (that then induce large changes in the ratio).

10 Likes

Thanks @betanalpha, this is helpful!

I had earlier thought along these lines

I had decided to keep the priors for a and b (maybe too) wide, to make it easier to show to readers sceptical about Bayesian analysis and prior choices that the data inform the posterior much more than the prior does. @betanalpha’s reply makes clear that this creates a different problem.

When the fat tail of the push forward distribution of a/b is problematic, one potential solution is to restrict reduce the prior probability of very low values of b. (I am not sure this will work, because the pushforward distribution is a ratio distribution)

1 Like

It might be clarifying to provide a concrete example of a prior on a generated quantity in a mathematically simple context where the Jacobian adjustment doesn’t matter.

Suppose I have a 3-element parameter vector v whose sum is necessarily zero. Suppose also that my domain knowledge suggests standard normal priors on each element of v. Note that the first two elements of v uniquely define the third, so I might think to declare parameters v_1 and v_2, and then set v_3 equal to the negative sum of the first two. Thus, I might write down a naive Stan program that attempts to place a standard normal prior on each element of v as follows:

// Warning: this program doesn't give the desired result!
parameters{
  vector[2] v;
}
model{
  v ~ normal(0,1);
  sum(v) ~ normal(0,1); // this is equivalent to a standard normal prior on the negative sum
}
generated quantities{
  real v3 = sum(v);
}

This of course does not yield standard normal marginal distributions for any element of v. The whole issue has to do with the fact that there’s a prior on a generated quantity (there’s no Jacobian adjustment to worry about here, or if you prefer, the Jacobian adjustment is trivial).

In this example, it’s easy enough to write down a prior that yields the desired marginal distributions for every element of v; we just need a multivariate normal for v_1 and v_2 with variance-covariance matrix \begin{bmatrix} 1 & -0.5 \\ -0.5 & 1 \end{bmatrix}.

But if we wanted to, we could also just play with the sd parameter in our Stan code above until we hit on something like

parameters{
  vector[2] v;
}
model{
  v ~ normal(0,1.22);
  sum(v) ~ normal(0,1.22);
}
generated quantities{
  real v3 = sum(v);
}

When we have a prior expectation about something more complicated than the sum of two parameters, this sort of heuristic tinkering might (or might not) be an easier route towards an accurate prior than attempting to specify the right kind of multivariate generative prior (the required distribution might not correspond to anything that is easily implemented). However, as @betanalpha cautions, tinkering in this way requires checking the pushforward distributions (what I had previously called the prior predictive distributions) for every quantity of interest about which domain expertise can be brought to bear. Also, the parameter space over which we need to tinker can be very large. In the above example, we just had to tinker in one dimension, since symmetry considerations imply that all elements of v (including v_3 need the same prior.

5 Likes

This is a great thread and it’s something that would be really useful as a case study or initially - and at the very least - a blog post on https://blog.mc-stan.org/.

Putting all the info about prior on derived quantities, a bit about pushforward distributions, and expanding on the example from @jsocolar. I can attempt it but thought I’d give the priority to @jsocolar @betanalpha @Guido_Biele.

One way to reframe this is is to emphasize that Bayesian inference requires a prior distribution over the entire model configuration space. That can be constructed from independent priors on each of the nominal parameters, but such a prior is often not consistent with the entirety of our domain expertise. We not only need to carefully investigate the consequences of the prior model but also recognize that concepts like “informative” are context dependent and a prior that seems “noninformative” in some directions can be very informative (in bad ways) in other directions.

This is the critical point and why statements like “putting a prior on a generated quantity” don’t actually make sense, and hence in my opinion why they are extremely dangerous.

In Bayesian inference we never actually “assign a prior” to individual variables. Such a concept is not parameterization invariant and the posterior distribution never actually utilizes any such independence structure – it’s constructed only from the joint prior distribution over the entire model configuration space.

The only way to formalize a concept like “assigning a prior to a variable” is to consider the marginal behavior of the joint prior model. In particular we can say that a prior model \pi(\theta) assigns a prior \pi(\vartheta) to the variable \vartheta = f(\theta) if \pi(\theta) pushes forward (or marginalizes) to \pi(\vartheta). As @jsocolar notes, however, the heuristic often advertised as “putting a prior on a variable” does not actually achieve this in any mathematically self-consistent way.

What’s actually happening under these heuristics is the specification of a prior model through a non-generative prior density function. Instead of specifying the prior with conditional probability density functions consistent with the generative structure of the model the prior is specified with some arbitrary function of over the entire model configuration space. That function might include component functions that depend on only one variable at a time, but those functions cannot be interpreted as priors in any meaningful way; they are functions over generated quantities, not interpretable priors. Note that this is one place where the ~ notation in the Stan Modeling Language can be abused.

a ~ normal(0, 1);
f(a) ~ normal(0, 1);

is a valid Stan program but neither line has a meaningful probabilistic interpretation on their own. Avoiding the syntactic sugar,

target += normal_lpdf(a | 0, 1);
target += normal_ldpf(f(a) | 0, 1);

better communicates that we’re just building up a joint function over the parameter space.

Technically there’s nothing wrong with heuristically building the joint prior density function at once. Ignoring the generative structure can make it harder to build reasonable prior models but with enough prior checking useful heuristics can certainly be developed. In other words while this approach may not be recommended, especially as a default, it could be useful in certain circumstances. At least when interpreted correctly and never thought about as “putting priors on generative quantities”.

Another fun example for anyone who is interested in empirical investigations is

parameters {
  ordered[10] x;
}

model {
  x ~ normal(0, 1);
}

The interaction with the constraint and this non-geneative prior has some counterintuitive behaviors!

Discussion about discussion from different perspectives is always beneficial so I would be happy for anyone to write a blog post.

In my opinion case studies are tricky because they carry an heir of authority to them which can be misinterpreted without the right context (for example in my opinion the “Prior Best Practices” document is a mess because it aggregates too many heuristics without the needed context).

4 Likes