Jacobian in reparameterizations

Hello all,

I have a question regarding the inclusion of the Jacobian in Stan log probability. There is a nice discussion in stan user guide about when we should include Jacobian and when we shouldn’t: Stan User’s Guide

In summary, one needs to add the Jacobian correction whenever the result of the transformation is given a distribution:


  real log_y;
  log_y = log(y);
  log_y ~ normal(mu, sigma);
  target += -log_y; // add jacobian

rather than applying the transformation to the sampled variable:

x ~ normal(0,1);
y = x*std + mu; 
target += 0 // no need to add Jacobian correction here.

However, I don’t understand why the later does not require us to add the Jacobian. As far as I know, in my latter code, the probability of y is given by:

p(y|\mu,\sigma^2) = \mathcal{N}(x|0,1)\text{abs}\left|\frac{\text{d}f(x)}{\text{d}x}\right|^{-1}

where f(x)=x\sigma+\mu. The log probability of y is computed by the log probability of x and the Jacobian resulting from transforming x into y. Why we don’t need to add the Jacobian correction in this latter case in Stan? Is it because it gets cancelled when computing the metropolis hasting accept reject ratio?

Thank you!

In the later case, nothing is changing the target log density. You could just as easily have put it in generated quantities, which never gets touched by HMC.

There’s a more nuanced question though about exactly when you need the jacobian. Take a look at this, it may help http://models.street-artists.org/2017/05/30/once-more-about-stan-and-jacobian-warnings/

Thanks for the answer, will check that post. After a while I found this while googling, Demystifying Jacobian adjustment for transformed parameters in Stan, which seems to nicely explain what is the difference when applying the transformations in the ways I have outlined above (i.e what Stan is doing under the hood).

1 Like

The probability density of y is given by

p(y \mid \mu, \sigma) = \text{normal}(f^{-1}(y) \mid 0, 1) \cdot \left| \frac{ \mathrm{d} f }{ \mathrm{d} x} \right|^{-1}(f^{-1}(y)).

Now

\begin{align*} \text{normal}(f^{-1}(y) \mid 0, 1) &= \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{1}{2} \left( f^{-1}(y) - 0 \right)^{2} \right) \\ &= \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{1}{2} \left( \frac{y - \mu}{\sigma} - 0 \right)^{2} \right) \\ &= \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^{2} \right) \end{align*}

This almost looks like a normal density function; the only problem is the normalization. Fortunately we still have that Jacobian determinant term,

\begin{align*} \left| \frac{ \mathrm{d} f }{ \mathrm{d} x} \right|^{-1}(f^{-1}(y)) &= \left| \frac{ \mathrm{d} (\mu + \sigma \cdot x) }{ \mathrm{d} x } \right|^{-1}(f^{-1}(y)) \\ &= \left| \sigma \right|^{-1}(f^{-1}(y)) \\ &= \frac{1}{\sigma}, \end{align*}

which is exactly the missing normalization!

In other words we’ve just proved that

\begin{align*} p(y \mid \mu, \sigma) &= \text{normal}(f^{-1}(y) \mid 0, 1) \cdot \left| \frac{ \mathrm{d} f }{ \mathrm{d} x} \right|^{-1}(f^{-1}(y)) \\ &= \frac{1}{\sqrt{2\pi}} \exp \left( -\frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^{2} \right) \cdot \frac{1}{\sigma} \\ &= \frac{1}{\sqrt{2\pi \sigma^{2}}} \exp \left( -\frac{1}{2} \left( \frac{y - \mu}{\sigma} \right)^{2} \right) \\ &= \text{normal}(y \mid \mu, \sigma). \end{align*}

In other words

\text{normal}(x \mid 0, 1) \\ y = \mu + \sigma \cdot x

is equivalent to

\text{normal}(y \mid \mu, \sigma)

exactly because of the Jacobian!

We don’t need a Jacobian in the model here because we’re not trying to force any behavior on y = \mu + \sigma \cdot x. Instead we’re forcing behavior on x and letting Stan figure out the induced behavior on y. If we want to work out that induced behavior analytically we have to include the Jacobin as above.

Jacobians are needed when we have a density that depends on a derived quantity, \pi(y). In general this will not induce behavior on x such that the behavior of the derived quantity y = f(x) is quantified by \pi(y). Instead the induced behavior is quantified by

\pi_{\text{induced}}(x) = \pi(f(x))

which then implies a density

\begin{align*} \pi_{\text{induced}}(y) &= \pi_{\text{induced}}(f^{-1}(y)) \left| \frac{ \mathrm{d} f}{\mathrm{d} x} \right|^{-1} (f^{-1}(y)) \\ &= \pi(f \circ f^{-1}(y)) \left| \frac{ \mathrm{d} f}{\mathrm{d} x} \right|^{-1} (f^{-1}(y)) \\ &= \pi(y)) \left| \frac{ \mathrm{d} f}{\mathrm{d} x} \right|^{-1} (f^{-1}(y)) \end{align*}

for the derived quantity. Note that here the Jacobian term skews the behavior of y from the expected \pi(y)! If we had included the inverse Jacobian determinant correct as we should have then it would have canceled with this extra term to give \pi_{\text{induced}}(y) = \pi(y) as desired.

5 Likes

Thanks for your explanation @betanalpha .

The only thing I dont understand is, what do you refer by “derived quantity” here: Jacobians are needed when we have a density that depends on a derived quantity, \pi(y)

From what I understand, \pi(y) is the distribution induced by transforming x with f() or equivalently, y with f()^{-1}, which would be something similar to my analytic example where f() is given by a linear transformation.

In Stan the target distribution is defined via a probability density function specified in the model block over the parameter space specified in the parameters block. Any function of the parameters (other than the identify function) is considered a derived quantity.

Any statement like

target += example_lpdf(parameters | anything);

doesn’t require a Jacobian, but any statement like

real derived = function(parameters);
target += example_lpdf(derived | anything);

will require a Jacobian correction if you actually want the marginal density for derived to be example.

One of the courses of potential confusion here is that there are many density functions around. Given a nice transformation f : X \rightarrow Y any probability density function on X, \pi(x) defines an equivalent density function on Y, \pi_{\text{induced}}(y). But we can also define an infinite number of densities functions on Y that have no relation to \pi(x). Plugging x = f^{-1}(y) into \pi(x) defines a probability density function on Y, but not the equivalent density function. The equivalent density function requires a Jacobian correction.

3 Likes

Am I correct that something like (silly example for clarity):

model{
  vector[n] derived_var = (data_var - parameter_1)/parameter_2 ;
  target += std_normal_lupdf(derived_var) ;
}

doesn’t need a Jacobian adjustment because the operations producing derived_var are linear transformations?

That target written out is

target += -0.5*(derived_var)^2;

I guess this is meant to be a different way to write

target += normal_lpdf(data_var| parameter_1, parameter_2);

but that’s the same as

target +=  -log(parameter_2) - 0.5*((data_var - parameter_1)/parameter_2)^2;

That extra -log(parameter_2) is the log Jacobian correction for multiplying by parameter_2.

2 Likes

Technically one always needs Jacobian corrections, but because Stan requires target densities specified only up to constants Jacobian corrections that reduce to constant values can be ignored. A Jacobian correction is constant if and only if the transformation is linear.

Division, however, is not a linear operation and instead introduces a 1 / parameter_2 Jacobian correction.

2 Likes