Jacobian adjustment in the acceptance probability

I’m interested in the theory behind the HMC algorithm and the change of variable.

Let say, we transform a set of constrained parameter with the following transformation: eta = f(theta), such that eta is unconstrained. The likelihood is invariant to this transformation but the prior must be multiplied by the determinant of the Jacobian (derivative of theta = f^-1(eta) with respect to eta). Therefore we must add the log of the Jacobian to the log posterior. The gradient of the log posterior must take into account the Jacobian adjustment? Then to have the gradient in the unconstrained domain we must multiplied it by the Jacobian due to chain rule right?
These Jacobian adjustments give the log posterior and its gradient in the unconstrained domain which is what we want for the proposal distribution.

Do we have to have to also add the Jacobian of the transformation (derivative of eta = f(theta) with respect to theta) in the acceptance probability such that we target the posterior of the constrained space (the original parameters).

Am I getting the theory right?

Let say, we transform a set of constrained parameter with the following transformation: eta = f(theta), such that eta is unconstrained.

Because Stan samples on the unconstrained space, I’m going to assume eta are the Stan parameters.

The likelihood is invariant to this transformation but the prior must be multiplied by the determinant of the Jacobian

I wouldn’t separate these two in your head. They mix together. NUTS works off the total log probability density, likelihoods and priors and whatever all get mixed up. Stan models can end up looking pretty weird.

Therefore we must add the log of the Jacobian to the log posterior.

There’s not enough information about the model so far to assume that this is necessary. You gotta keep your goal in mind. In certain inferences with certain models, you won’t need the adjustment to get the answer you seek. In other cases, you do.

The way I look at it is, with a sampling algorithm, if you need to evaluate any probability of theta (your transformed variable) , you’ll need the adjustment (this includes putting a prior on theta). Otherwise, you don’t. (If theta is on the left hand side of a sampling statement, you’ll need the transform)

From the manual (section Change of Variables vs. Transformations):

A transformation samples a parameter, then transforms it, whereas a change of variables transforms a parameter, then samples it. Only the latter requires a Jacobian adjustment.

There’s various other tidbits around the manual on this. Big thing is just go try implementing stuff yourself. Try coding up the lower/upper bound constraints in the Stan manual and that’ll help you get it straight (Chapter 34. Transformations of Constrained Variables). This conversation comes up a bunch. Easiest way to figure it out is just go walk through some examples.

if p(data,theta) dtheta is a measure on the parameter space theta, and Stan wants to sample in an unconstrained space eta then the probability that Stan needs to enforce is that

p(data,theta(eta))dtheta is still the probability to be in a micro-volume dtheta even though we’re moving around in eta space

The probability for the sampler to be in a volume deta needs to be

p(data,theta(eta)) dtheta(eta)/deta deta = p(data,theta) dtheta(deta)

where by dtheta(eta)/deta in the general multivariate case we use the determinant of the jacobian matrix dtheta_i/deta_j because this determinant measures the scaling of the n dimensional volume mapping deta -> dtheta, Note that the Jacobian determinant is a function of the eta in general. For linear transformations, it’s a constant, and so can be dropped because constants don’t matter for MCMC.

This all makes perfect sense in nonstandard analysis because now those symbols “dtheta/deta” are in fact the ratios of infinitesimal numbers. The whole calculation is just straightforward algebra.

The jacobian only comes into play when we are transforming the space in which we’re sampling. If you sample in say “foo” space and then want to say p(bar | q(foo)) = something then the fact that you’re transforming foo through q doesn’t matter because it doesn’t alter the probability for the sampler to be in a tiny volume dfoo, it alters the probability for bar, which is altered in precisely the way that you want it to be because the probability of bar depends on the value q(foo) by definition of your chosen model.

The way Stan works, is it chooses unconstraining transforms to create an “eta” space which is the whole real line in each dimension, and then it samples in that eta space. Then when it needs to spit out a theta value, it uses the constraining transform to calculate theta(eta) and spits out that value.

So, Stan behind the scenes is adding the log(dtheta/deta) values to the lp value so that eta samples in the right distribution to give theta(eta) the distribution you asked for.

If you do any transformations of variables yourself, then you’ll have to decide what they mean to you. But the principle is that if you know the measure on a transformed space, in order to get the transformed value to have that measure, the space in which you’re sampling has to have the measure p(theta(eta)) dtheta/deta deta just think of all those d values as infinitesimal numbers and the algebra becomes clear (and this is how nonstandard analysis works).

This is how probability densities (and their gradients) behave in general, regardless of whether we are using them in HMC or not. As @bbbales2 notes it’s easier to simply reason about the total posterior directly instead of prior * likelihood as the latter has some subtle structure that ultimately cancels out but can make arguments more tricky than they need to be.

If we were doing Metropolis-Hastings then no, we wouldn’t need to account for the Jacobian because the acceptance probability is a function on parameter space, not a density. Like the likelihood in your argument, the acceptance probability behaves the same way provided that you compose it with the transformation (to be technically we don’t say that the functions are invariant per your terminology, but rather covariant).

Another way to see that is the acceptance probability can be written as a ratio of probability densities, so if you were to add the Jacobian it would cancel out.

All that said, the dynamic HMC implementation we use in Stan is not actually a Metropolis-Hastings method. It does use probabilities to sample points from intermediate trajectories, but those probabilities are not the class Metropolis acceptance probability.

Thank you everyone for these replies, it is very helpful.
However I’m still confused about the acceptance probability in the Metropolis-Hastings (I’m sorry if my theoretical questions are not directly related to the STAN software, I found the forum after I read the book of Gelman and the STAN documentation. If my topic doesn’t fit the forum I totally get it).

In the acceptance probability, the posterior distribution must be w.r.t theta or eta? In the first case, it’s evaluated with theta and the proposal distribution is in the eta space. This is why it is suggested in https://github.com/gragusa/BayesianTools.jl/blob/master/README.md or https://arxiv.org/pdf/1511.01707. (section 6.3.2) that the ration of the Jacobian determinant must be added in the acceptance probability such that the algorithm targets the right posterior distribution, here theta.
In the second case, we target the posterior distribution of eta since the posterior is w.r.t to eta (Jacobian adjustment) and also the proposal. Then It doesn’t feel right to just transform the Markov chain such that the posterior is w.r.t to theta.

I could be wrong, I’m not a Stan coder so I haven’t looked at the code, but I think Stan calculates the posterior density with respect to the unconstrained eta space, the “proposal” (ie. the hamiltonian trajectory) is based on “energy” that is calculated wrt the density on the eta space, and then the point that is chosen from the trajectory is chosen based on weighted sampling with weights computed by the density in the eta space. Whenever we need a theta value, we just take our current eta value and transform it.

The density in eta space is induced by the chosen density in theta space, and the Jacobian correction caused by the transformation function Eta(theta), it’s simply a formal calculation that now gives a density in an unconstrained space such that samples transformed back to theta space via Theta(eta) come from the density we chose in theta space.

Typically the entire Markov chain is on either the constrained or unconstrained spaces, so the acceptance probability is computed with respect to the corresponding densities. Again, mapping back and forth the Jacobians cancel to yield the same acceptance probability for any given transitions.

The only time you’d need to need to include a Jacobian is if the original point was defined in the constrained space and the proposal in the unconstrained space (or vice versa) but the that would be an extraordinarily weird algorithm.

It’s the case, my original parameters are in the constrained space because they have a physical meaning. But, the proposal samples in the unconstrained space, this is why I added the ratio of Jacobian in the acceptance probabilities has suggested in the links given above.
One Jacobian determinant is evaluated at the candidate and the other at the last accepted candidate, I don’t see why the Jacobians cancel.

If the terms in the acceptance probability are in the unconstrained space, how can I transform the sample from the Markov chain such that I have access to the posterior distribution in the constrained space?

I strongly recommend in the highest terms to not try to develop a Markov chain that switches between two parameterizations. That’s just asking for trouble.

We can map densities back and forth with Jacobian corrections provided that the map is 1-1, in other words either space provides an equivalent representation of the system. What you want to do is map the target density from the “interpretable” constrained space to the unconstrained space, run your Markov chain on the unconstrained space, and then finally map all of your samples back to the constrained space using the inverse map. That way the Markov chain evolves in only one space and you don’t have to worry about trying to mix densities with respect to different base measures.

The point of the Jacobian correction is that if theta has distribution p(theta) and eta = f(theta) is invertible, then eta has distribution p(finverse(eta)) |dtheta/deta|

if you get a sample of eta according to that density, then you transform your sample through finverse(eta[i]) you will get theta[i] values that have the right distribution over theta

so, just do all your calculations, proposals and acceptance, in eta space. at the end, take all the eta[i] numbers and run them through finverse(eta[i]) and you’ll have a sample of theta[i] values.

Thank you everyone, it’s very helpful ! However, I need a last clarification.

Let say we have eta = f(theta) and theta = f^-1(eta).
My posterior is evaluated with theta, so to have the posterior in the unconstrained space I need the Jacobian adjustment such that: log [ p(eta | y) ] = log [ p(theta | y) ] + log [ J(eta) ], where J(eta) = | d_f-1(eta)/d_eta |
The gradient and the Hessian are also evaluated w.r.t theta. The problem is that I’m not sure to convert them in the unconstrained space properly. I’ve tried 2 solutions:

d_log [ p(eta | y) ] / d_eta = (d_log [ p(theta | y) ] / d_theta)*J(eta) + d_log [ J(eta) ] / d_eta
d_log [ p(eta | y) ] / d_eta = (d_log [ p(theta | y) ] / d_theta + d_log [ J(f(theta)) ] / d_theta)*J(eta)

I obtain the same gradient with both methods but not the same Hessian. I follow the same principle for the Hessian except that Hessian is pre multiplied by the transpose of the Jacobian and also post multiplied by the Jacobian.

There’s a chapter near the end of the manual that goes over what’s happening.

The math’s confusing because it’s one of those inverse things with sign changes that’s easy to confuse. But what’s going on under the hood is pretty simple:

For any parameter x declared with a constraint, we have an unconstraining transform y = f(x). Stan does all of its sampling with respect to the unconstrained paramters y. But it expresses the model log density in terms of x. To adjust for this change of variables, we add the log of the absolute Jacobian of the inverse of f evaluated at y to the log density. So the gradient of the log density must take derivatives through the Jacobian. All in

log p(y) = log p(x) + log J_f(y)

where J_f_inv(y) is the Jacobian of the inverse of f evaluated at y.

Then if you do something like Metropolis-Hastings or static HMC (as opposed to NUTS) where there’s an acceptance probability, it uses log p(y) just as if you’d defined it directly.