Does the soft-constraint converge to rigid-constraint?

yuling · October 21, 2018, 5:26pm

tl;nr.
No.

Motivations

I have often seen the recommendation of soft-constraints. The main reason is because a bayesian will rarely prefer point mass. Apart from that, I am now curious whether the current implementation of these two in Stan are essentially equivalent, in the limit case.

Let me frame the problem more rigorously. Consider the parameter space \theta \in \Theta = R^d , and a transformed parameters

z=\xi(\theta),\quad z \in R^m

where \xi : R^d \to R^m, d>m is a smooth function. We also need some regularization on \xi to make it full-rank. Denote \nabla \xi as the d\times m gradient matrix and G is the gram matrix G=\nabla \xi ^T\nabla \xi , it should satisfy

\mathrm{rank} (\nabla \xi)=m.

When there is no constraint at all, the posterior distribution \nu(\theta) is proportional to \exp(-V(\theta)).

So far so good, until I start writing

transformed parameters
{
z = xi (theta);
}
model
{
z ~ distribution(....);
}

Rigid constraint

Suppose in some situation we need a constraint on \xi(\theta). A non-Bayesian wants \xi(\theta) \approx z_0, z_0 \in R^m. z is sometimes termed as reaction coordinate in physics.

Firstly, with rigid constraint, we need to sample from the submanifold

\Sigma(z_0)= \{\theta\in\Theta | \xi(\theta) =z_0 \}

and the conditional distribution

\pi^{\xi}(d\theta | z_0 ) \propto \exp(-V(\theta))\delta_{\xi(theta)-z_0} d\theta.

That is the probability measure of \nu conditioned to a fixed value z_0 of the reaction coordinate.

To clarify, we now also have the surface measure on \Sigma(z): d\sigma_\Sigma. It is the lebesgue measure on \Sigma(z) induced by euclidean space.
The co-area formula gives

\delta_{\xi(theta)-z_0} d\theta= (det G)^{-1/2} d\sigma_\Sigma.

Later we will see the co-area formula is intrinsic, and it is not equivalent to the change-of-variable formula.
The latter one reads dy = J_x( y ) dx.

The first question is: which measure should we sample from? If you think it is a trivial question, think about those transformed parameters that do not have an explicit prior in the model block, they enter the log density as if there is a uniform prior, but why do we never give a jacobian for them?

Short answer: here we are NOT doing variable transformation at all and we are always in the \theta space, but we do have the freedom to choose to stand in the ecludian space, or the surface. Of course, the typical choice is the conditional distribution \pi^{\xi}(d\theta | z_0 ).

Sampling from a contained dynamic is not supported in Stan, unless it is a simple case and we can easily reparametrize (e.g., simplex, sum-to-zero), which is equivalent to explicitly solve the inverse \theta= \xi^{-1}(z_0). In general this can be done by adding the Lagrangian multiplier \lambda in the Hamiltonian system (e.g., Hartmann and Schütte, 2005 ). The main idea is to modify the Hamiltonian by V^{constraint}(\theta) = V(\theta)+ \lambda (\xi(\theta)- z_0) . Due to the very first reason I mentioned in the beginning, I know we probably do not have strong motivation to implement it.

Soft constraint

Another strategy is to put a soft constraint. For example, we put a strong prior on \xi(\theta): \xi(\theta) \sim N (z_0, \tau ^2 \mathrm{Id}_{m\times m} ). That is target += \frac{-1}{\tau^2} (\xi(\theta)-z_0)^2.
In the limit case \tau\to 0, the marginal density of \xi(\theta) (the image of measure \nu by \xi) will be guaranteed a Dirac measure at z_0.

OK, only having target += \frac{-1}{\tau^2} (\xi(\theta)-z_0)^2 is wrong because we do not take into account the variable transformation.

Intuitive, no matter how small \tau is, the dynamic still see the variation around the surface \Sigma. Indeed in the limits case \tau \to 0, it converges to

\exp(- V(\theta) ) d\sigma_\Sigma

which can also be rewritten as

\exp(- V(\theta) + \frac{1}{2} \log det G ) \delta_{\xi(theta)-z_0} d\theta,

where G is the gram matrix defined earlier (proof see e.g., Ciccotti et al 2008).

The term \frac{1}{2} \log det G is named Fixman correction (Fixman 1974).

Adding a Jacobian?

From my understanding, Stan recommends adding the log absolute det jacobian into the log density, whenever we have a model on the transformed parameters z.

It is somewhat vague what it means. Jacobian means the derivative to most users. Sure, when \xi(theta) is a bijection and \nabla \xi(\theta) is a squared matrix,

\frac{1}{2} \log det G = \log |det \nabla \xi |

is an exact identity. However such bijection is impossible in reality, because it means we put a soft constraint on all parameters, and in the limit case the joint posterior is degenerated to a Dirac measure purely due to prior.

Other than it, \xi is map from R^d to R^m with d>m, if not always much larger. From the old post 1, and old post 2

Non-invertible change of variables?

If you have a one-to-many transformation then you do the above but sum over all of the possible values of the transform inverse. This behaves well except in cases where there are infinite number of values, for example when you try to map a circle into the real line by transforming (0, 2pi) to (-\infty, \infty), in which case the transform becomes ill-posed.

If you have a many-to-one transformation then you have to be very careful to ensure that you can self-consistently define a probability distribution on the transformed space. There is no “implicit distribution theorem” – the only way to do this is to “buffer” the output with additional variables so that you can define a one-to-one transformation, solve for the appropriate density on the buffered space, and then marginalize out the buffer variables. This is, for example, how one goes from a 2D Gaussian to a Rayleigh distribution or a 3D to a Maxwell, or the N simplex to the N-1 dimensional unconstrained real space we use in Stan (as @Bob_Carpenter notes below) . Typically this approach is limited to cases where the marginalization can be done analytically so it’s hard to apply to complex problems like this.

That is, augment the transformed parameters z^{*}=(z, z^{aug}), so as to make it one-to-one transformation between \theta \to z^{*}. And we calculate the jacobian and marginalize out the auxiliary variables.

The augmented Jacobian reads

J^{aug}=\begin{bmatrix} \nabla \xi(\theta)_{1:m} & 0 \\ \nabla \xi(\theta)_{m+1:d} & I_{(d-m)\times (d-m)} \\ \end{bmatrix}

You should be convinced that \log |det J^{aug}| \neq 1/2 \log det G, no matter how they look like.

If we are talking about variable-transformation now, I totally agree. It like log-normal or inver-gamma. We put a prior on the transformed parameters, as if we generate such transformed parameters (e.g. normal), and transform back to the original space (log-normal).

Generative vs Embedding, or change-of-variable vs conditioning

They are different.

Stan manual write:

A transformation samples a parameter, then transforms it, whereas a change of variables transforms a parameter, then samples it. Only the latter requires a Jacobian adjustment.

Agree, we need a Jacobian as if we know the generative distribution in the transformed parameters space, and we want to transform back to its original space. In principle, we do not even need any extra prior on \theta.

But there is another situation: which I will call embedding. We know the distribution in \theta space \nu, but we want to find what is it embedding/conditional distribution on the submanifold \Sigma_{z_0} induced by those transformed parameters. The section above shows that in order to keep invariance, we need an adjustment by 1/2 \log det G, not Jacobian. Notice that we do not perform change-of-variables.

Prior-Prior conflict?

I don’t think I have made a clear clarification. Though conceptually easy, the confusion between the gram matrix and Jacobian seems common, even for experts in molecular-simulation experts (see Otter 2013 and Wong and York 2012 for a very similar argument)

I have not discussed about the one-to-many transformation, or d>n. In the limit case it is over-identified. But there is nothing prevent me defining a regularization on all \theta, and \theta^2, and \theta^m \sim N(0,\alpha^m) . In that case I know the distribution of \theta, and I am loosely regularize all moments not so being wild. I do not even ask for consistency as \tau is fixed and is not close to 0. I think it is fine to leave all constraints by itself, without further adjustment. This is to echo what I have pointed put earlier: we do have the freedom to choose to stand in the ecludian space, or the surface.

We may also wonder, is it a prior-prior conflict, if we define a prior on theta space, and define another prior/embedding on the transformed parameters. Typically it is hard to detect as sum of multiple convex function is still context. But if m>d, the sub-manifold can be empty in the limit.

Conclusion

For variable transformation (e.g. constraint parameters, simplex,…) and generative prior defined by transformed parameters (e.g. log-normal, inv-cauchy), jacobian is always correct, and the ONLY correct option. There is a natural bijection, so the jacobian is well-defined.
The purpose of embedding is to put soft constraint on some reaction coordinate, and it does not change variables. If it is the case, in order to keep asymptotic consistency, the log density should be adjusted by 1/2 log det G, not jacobian.
Hard-constraint is another less-recommend but possible option for general transformations.
When the soft constraint itself is loose (e.g. z ~N(0,2)), it only represents some approximate regularization. Lack of 1/2 log det G adjustment should also be reasonable.

Some questions I am not clear:
the embedding interpretation and importance-sampling; when should we put either adjustment in the optimization for MAP.

[edit: fixed typo]

betanalpha · October 21, 2018, 11:16pm

It’s in discussions like these that it’s very, very important to keep in mind the difference between a probability distribution, a base (Lebesgue) measure, and a probability density function. These will be technical points but they’re important is clearing up some of the questions @yuling has raised.

A probability distribution is an abstract allocation of probability across a \sigma-algebra. Presuming that the map \xi is nice enough then there’s an immediate pushforward of any probability distribution on \mathbb{R}^{d} to a unique measure on the image of \xi, typically denoted the marginal distribution.

A probability density function is, well, a function. It’s the Radon-Nikodym derivative between a given probability distribution and a Lebesgue base measure, which allows us to compute expectations with respect to the probability distribution as certain expectations with respect to the Lebesgue base measure, which is just integration. Unfortunately, even if \xi is nice it may not be easy to find the probability density function that corresponds to the pushforward probability distribution.

The problem with identifying pushforward probability density functions is that the Lebesgue measure isn’t unique! Remember that any bijection from the real numbers into themselves, or a reparameterization, will preserve a Lebesgue measure only when it’s a linear transformation. In general the Radon-Nikodym derivative between the pushforward of a Lebesgue measure and the Lebesgue measure on the image is… the determinant of the Jacobian! Intuitively we can think of all of this mess as accounting for how differential volume changes along the transformation. For the more geometric among us, one can equivalently think of this as a consequence of how the Euclidean metric changes with these reparameterizations (the metric and the Lebesgue measure can be derived from each other for the real numbers through the Hausdorff measure, which is the |G| term that pops up all over the above calculations).

Now we can reparameterize the joint space, and we can reparameterize the image of \xi, but we don’t have to keep those reparameterizations consistent. We can choose whichever parameterizations we want for both so long as we are careful with the bookkeeping.

The submanifold given by the image of \xi has two immediate parameterizations and corresponding Lebesgue measures. The first is the intrinsic parameterization which is defined by the map xi – this should correspond to the “generative” approach @yuling describes – where the Lebesgue measure on the submanifold is defined as the pushforward of the Lebesgue measure on the input space. The second is the extrinsic parameterization which is defined by the embedding of the submanifold into the input space – this should correspond to the “extrinsic” approach @yuling describes – which depends on how the submanifold fits into the metric structure of the input space.*

Because we have multiple Lebesgue measures on the image of \xi we can have multiple probability density functions on the submanifold that correspond to the pushforward probability distribution!

The danger in all of this is when we specify target distributions with probability density functions and get careless about the underlying base measures and come across seemingly contradictory results. The probability distributions and their transformations are all well-behaved, it’s inconsistent parameterizations, base measures, and probability densities that can cause us to drift away from those well-behaved distributions. Because we specify target distributions in Stan with densities we have to be very careful about implicitly specifying models on submanifolds. I personally find the generative approach to be extraordinarily cleaner because it avoids the problems induced by the projection from a higher-dimensional space to a lower-dimensional space.

*Geometric measure theory is nasty in some part because it focuses on the embedding picture (the other part is that it tries to generalize to non-smooth maps), whereas in my opinion everything becomes much more simple when focusing on the intrinsic geometries as we did in The geometric foundations of Hamiltonian Monte Carlo. Although the latter requires knowing a good bit of differential geometry!

Some additional notes:

Naive Lagrange multipliers don’t work all that well in practice because they introduce additional parameters that don’t have the symplectic structure we all love. Indeed holonomic constraints can be really tricky to handle in Hamiltonian systems because the constrained submanifold need not admit a symplectic structure. Explicitly symplectic integrators were introduced in the Shake algorithm and then improved in the Rattle algorithm. See “Geometric Numerical Integration” and Markov Chain Monte Carlo on Constrained Spaces.

Keep in mind that this “trick” works only when there is a reparameterization of the joint space such that the joint space decomposes into a product of the constrained submanifold and auxiliary parameters. In that case the embedding is trivial and everything aligns.

yuling · October 22, 2018, 12:30am

Hi Michael, Thanks for reply.

In my previous post I sometimes use density and distribution interchangeably, when the “density” is with respect to lebesgue measure on R^{d}. Otherwise I aways specify the d\theta and d\sigma_{\Sigma}.

Agree.
Another demonstration of such difference is to have a closer look at the the delta function \delta. I copy the following from one textbook ( Stoltz and Rousset, 2010 ):
We may also define \delta_\Sigma as <\delta_{\Sigma}, \phi>= \int_{\Sigma} \phi d \sigma_{\Sigma}, where \phi is a test function. Then the factor det G does not appear in the formula. The dirac distribution \delta_\Sigma should not be confused with the the notation \delta_{\xi_q-z_{0}} (d\theta).

For a fair comparison, I do represent everything in d\theta and get the distribution proportional to \exp(−V(\theta)+1/2 \log detG) \delta_{\xi (\theta)-z_0}d\theta in the limit case. This is the density with respect to the lebesgue measure in euclidean space \Theta = R^{d}. After all, we sample from \Theta, so this should be the density that we feed into target+= ?

I don’t think there is contradiction. I am saying the prior for transformation (e.g. normal to log -normal) and the soft-constraints are different. The latter one is not generative (e.g. if I have a ~ N(0,1), b ~ N(0,1), log ab ~N(0,1) all together).

betanalpha · October 22, 2018, 2:53am

On one hand it’s somewhat irrelevant because Stan would never be able to sample from the distribution specified by that density on R^{d} as it is supported only on the constrained submanifold. If we ignore the computational difficult then the answer is “it depends”.

The \log | G | is used if you want the target distribution to be based on a uniform measure on the constrained submanifold, and it’s ignored if you want the target distribution to be based on a uniform measure on the ambient space. Neither is right or wrong as both distributions are well-defined, it just depends on the intent of the modeler. For example, if you wanted to be generative and model something intrinsic to the constrained submanifold using parameters in the ambient space then you would include \log | G |, but a distribution just meant to induce crude regularization around the constrained submanifold might be more naturally specified extrinsically.

We’re on the same page mathematically – I just wanted to emphasize the importance of the base measures, and why the densities don’t transform so cleanly, to help clear up why there multiple possibilities here.

yuling · October 22, 2018, 3:16am

This is great. Yes, it is another degree of freedom of the model.

I think it is safe for me to draw my conclusion again:

When I want to keep the distribution-invariance, i.e. the marginal of z=\xi to be exact distribution I modeled z \sim foo, given hypothetically no other priors on \theta, I should add log det jacobian.
When I want to keep the constraint-consistency, i.e. in the limit case the soft constraint converges to hard constraint, I should add 1/2 log det D.
And I should not fell guilty for not making any adjustment, if I only model the regularization term heuristically.

betanalpha · October 22, 2018, 3:51am

Yes but in only the context of a bijection followed by a marginalization, where some parts of the \log | J | might have to be integrated out in that marginalization, and only if such a sequence can be devised. For complex submanifolds that are subsets of the real numbers this may not be possible, and it will be ill-posed for submanifolds that have non-Euclidean topologies (like spheres and torii).

Presuming that the constrained submanifold is a subset of the real numbers then yes.

For general submanifolds the Hausdorff correction, \log | G |, will depend on the chart used in a given neighborhood of the submanifold.

It’s not about guilt so much as understanding and acknowledging the assumptions inherent in the heuristic to ensure that they are not in conflict with domain expertise or other parts of the model.

So basically “yes under certain conditions” like any good mathematical result.

Topic		Replies	Views
Imposing constraint in transformed parameters block Modeling	4	3013	May 1, 2018
Understanding the Hamiltonian Monte Carlo dealing with the games of parameters transformations (from constraint to unconstraint) Algorithms	4	1327	May 20, 2019
Soft constraint with min/max - does this break continuous differentation? Modeling	3	595	December 15, 2018
Constraining the sum of parameter vector Modeling techniques	1	536	May 30, 2020
Hierarchical Bayesian model and constrained optimization in RStan?! Modeling	4	1824	May 9, 2017