HMC Gradient information for original and transformed parameter space

Hi guys,

I recently dived into the STAN and there is a small detail in the manual I just do not understand:

The HMC sampler works in an unconstrained space. To do that, all constrained parameter are transformed into an unconstrained space and the corresponding Jacobian is added to P(\theta, y). Now, during that process some parameter may be transformed in a lower dimensional space (i.e. one has a Dirichlet prior in \theta), hence the space where HMC operates might be smaller than the original one.

Now within the HMC sampler, during one leapfrog step, one has to inverse transform the current sample to the original space in order to calculate the log-probability and get gradients of P(\theta, y) + adjustments. The dimension of the gradient vector should be equal to the dimension of the constrained space, not the unconstrained space of \theta.

Question:: As far as I understand, you get gradient information of all parameter in the original space, but sample (and update momentum) in the unconstrained space. There should be a miss-match in dimensionality between the gradients of the original parameter and the dimension where HMC operates?
For instance, \theta_i might be Dirichlet distributed with K elements, then I can transform \theta_i into a K-1 unconstrained space, but I would get gradient information for all K elements. Hence, to update the momentum variable, which is defined in the unconstrained k-1 space, I would only need K-1, and not K gradient elements?

Best regards,

1 Like

The inverse transform goes through the autodiff just like the rest of the log-probability computation. In other words you compute the gradient in the original space but then multiply by the Jacobian of the constraining transform which gives you the gradient in the unconstrained space.

2 Likes

All of the sampling happens on the unconstrained space. The only gradients considered are the gradients of the target density on the unconstrained space, after the transformation, Jacobian adjustment, and marginalization if needed. Similarly there are only momenta components corresponding to each of the unconstrained parameters. If you want to see all of these components then run with the “diagnostic_file” option in CmdStan, PyStan, or RStan set to something local. That file will contain the unconstrained state, the momenta, and the gradient elements at each iteration of the Markov chain.

The mapping to unconstrained space happens before sampling and the mapping back to the constrained space happens after sampling, so at no point are the unconstrained and constrained spaces compared or mixed up, hence there is no conflict.

2 Likes

Dear nhuurre and betanalpha,

Thanks to both of you for your answers! Apologies for the late response, I tried to wrap my head around it but I am still having troubles understanding that.

From the example before, lets say we have a categorical distribution. and assign as prior for \theta \sim Dirichlet(1,1,1). Then in order to sample in the unconstrained space, we can apply a transformation (from your manual) from \theta to g(\theta), and reduce the sample space from R^3 to R^2. Here we have to add | det J|, the absolute determinant of the Jacobian matrix of the transformation to the probability of interest, p(g^{-1}(\theta), y) ~ |det J(g^{-1}(\theta))|.

At the beginning of a standard HMC leapfrog step,

  1. the momentum variable p would be sampled from N(0,I_2),
  2. the current \theta_{transformed} \in R^2 would be inverse transformed to \theta_{original} \in R^3
  3. p(g^{-1}(\theta), y) ~ |det J(g^{-1}(\theta))| \in R^1 and its gradient \nabla_{\theta_{original}} \in R^3 would be evaluated.
  4. Here is the part I dont understand. How do I proceed to update p \in R^2 via \nabla_{\theta_{original}} \in R^3?
    I understand that you said I should multiply \nabla_{\theta_{original}} \in R^3 with J(g^{-1}(\theta)), but J(g^{-1}(\theta)) \in R^{3*3}, so the resulting vector will still not be in R^2, but in R^3, and I am back at my original problem…

Apologies for the long post, I tried to look it up but I just cant find anything where this is explained that even I can understand it.

Best regards,

This means that gradients on the constrained space are never computed or even considered!

Let’s go with your example, where are nominal model configuration space is parameterized by the three parameters (\theta_1, \theta_2, \theta_3) satisfying the simplex constraints. That constraint doesn’t make sampling easy, so Stan will implicitly transform to an equivalent, two-dimensional space without any constraints at all. Let’s call the parameters on this space (\eta_1, \eta_2).

Firstly because this transformation isn’t technically one-to-one so you have to be careful in exactly how you construct it. First you reparameterize the three simplex parameters into two unconstrained parameters and one fully constrained parameter and define a probability density function on this reparameterized space using the Jacobian density and all that jazz. Then you use the constraint to integrate the constrained parameter out of this reparameterized probability density function analytically. To see how this is done in detail for the simplex see [1010.3436] Cruising The Simplex: Hamiltonian Monte Carlo and the Dirichlet Distribution.

In any case the result of all of this is a well-defined probability density function on the two-dimensional unconstrained space, \pi(\eta_1, \eta_2). The Stan implementation of Hamiltonian Monte Carlo targets this unconstrained density function and nothing else. That means we have two conjugate momenta, (p_1, p_2) that we sample (from a normal with adapted variances) to form a 4-dimensional phase space with coordinates (\eta_1, \eta_2, p_1, p_2) on which we simulate Hamiltonian dynamics using the gradients

( \frac{ \partial \pi }{ \partial \eta_1} (\eta_1, \eta_2), \frac{ \partial \pi }{ \partial \eta_2} (\eta_1, \eta_2) )

to generate a new point (\eta_1', \eta_2', p_1', p_2') and ultimately the new state in our Markov chain (\eta_1', \eta_2').

This is repeated over and over again to generate a Markov chain over the two dimensional unconstrained space. The constrained space makes no appearance anywhere in the internal sampler code.

Instead we recover samples on the constrained space by mapping the unconstrained samplers after the fact. From each unconstrained sample, (\eta_1, \eta_2), we define an equivalent constrained sample by just evaluating the deterministic constraining maps, \theta_1(\eta_1, \eta_2), \theta_2(\eta_1, \eta_2), and \theta_3(\eta_1, \eta_2).

2 Likes

No, it’s not. The Jacobian matrix is 2x3. It’s a “local linearization” of the transform so the dimensions must match. The part where Stan “cheats” is that the determinant is computed using only the first two columns.

Technically, if you look closely at how the reverse-mode automatic differentiation operates you’ll see that the components of the constrained-space gradient are written to the autodiff stack. But yes, there’s lots of stuff on the stack and no reason to single out that particular gradient.

1 Like

Thanks a lot! This is it.

That made it clear wrt the gradients. Thank you!

There is no cheating. A probability density function does not have a well-defined transformation under anything but one-to-one maps, and so non-square Jacobians have no role. As I explained above you first compute the pushfoward probability density function along a one-to-one transformation, with a square Jacobian, and then analytically marginalize the resulting transformed probability density function using the constraint.

You can’t compare the initial density function and that final marginalized density function and construct a self-consistent update rule. Nor can you do the same to the gradients, as the differentiation doesn’t commute with the density transform (because the intermediate Jacobians depends on the parameters). See more below.

Sort of but also not really.

Let’s say that \theta denotes the constrained parameters, \eta the unconstrained parameters, and \phi: \eta \mapsto \theta the constraining map. If probability density functions transformed covariantly then the transformed density would just be \pi'(\eta) = \pi(\phi(\eta)). By the chain rule the gradient of that imaginary covariant density would then be

\frac{ \partial \pi' }{ \partial \eta} (\eta) = \frac{ \partial \pi }{ \partial \theta} (\phi(\eta)) \frac{ \partial \phi }{ \partial \eta} (\eta).

and it would look like the gradient transforms with a Jacobian.

But that’s not how densities transform. Under one-to-one maps densities pick up a copy of the Jacobian, so we really have

\pi'(\eta) = \pi(\phi(\eta)) \left| \frac{ \partial \phi }{ \partial \eta} (\eta) \right|.

Now when we apply the chain rule we get

\frac{ \partial \pi' }{ \partial \eta} (\eta) = \frac{ \partial \pi }{ \partial \theta} (\phi(\eta)) \frac{ \partial \phi }{ \partial \eta} (\eta) \left| \frac{ \partial \phi }{ \partial \eta} (\eta) \right| + \pi(\phi(\eta)) \frac{ \partial}{\partial \eta} \left| \frac{ \partial \phi }{ \partial \eta} (\eta) \right|.

Can you find the incorrect first gradient within the second correct gradient and then argue that the first gradient is hidden somewhere in the expression graph where the correct gradient is computed? Technically yes. But it’s only an intermediate term and it does not correspond to the gradient of some well-defined, self-contained mathematical object. It’s dangerous to try to pull it out and reason about terms like this as isolated abstractions exactly because they’re not self-contained. They’re just accidents of the implementation.