tl;nr.
No.
Motivations
I have often seen the recommendation of soft-constraints. The main reason is because a bayesian will rarely prefer point mass. Apart from that, I am now curious whether the current implementation of these two in Stan are essentially equivalent, in the limit case.
Let me frame the problem more rigorously. Consider the parameter space \theta \in \Theta = R^d , and a transformed parameters
where \xi : R^d \to R^m, d>m is a smooth function. We also need some regularization on \xi to make it full-rank. Denote \nabla \xi as the d\times m gradient matrix and G is the gram matrix G=\nabla \xi ^T\nabla \xi , it should satisfy
When there is no constraint at all, the posterior distribution \nu(\theta) is proportional to \exp(-V(\theta)).
So far so good, until I start writing
transformed parameters
{
z = xi (theta);
}
model
{
z ~ distribution(....);
}
Rigid constraint
Suppose in some situation we need a constraint on \xi(\theta). A non-Bayesian wants \xi(\theta) \approx z_0, z_0 \in R^m. z is sometimes termed as reaction coordinate in physics.
Firstly, with rigid constraint, we need to sample from the submanifold
and the conditional distribution
That is the probability measure of \nu conditioned to a fixed value z_0 of the reaction coordinate.
To clarify, we now also have the surface measure on \Sigma(z): d\sigma_\Sigma. It is the lebesgue measure on \Sigma(z) induced by euclidean space.
The co-area formula gives
Later we will see the co-area formula is intrinsic, and it is not equivalent to the change-of-variable formula.
The latter one reads dy = J_x( y ) dx.
The first question is: which measure should we sample from? If you think it is a trivial question, think about those transformed parameters that do not have an explicit prior in the model block, they enter the log density as if there is a uniform prior, but why do we never give a jacobian for them?
Short answer: here we are NOT doing variable transformation at all and we are always in the \theta space, but we do have the freedom to choose to stand in the ecludian space, or the surface. Of course, the typical choice is the conditional distribution \pi^{\xi}(d\theta | z_0 ).
Sampling from a contained dynamic is not supported in Stan, unless it is a simple case and we can easily reparametrize (e.g., simplex, sum-to-zero), which is equivalent to explicitly solve the inverse \theta= \xi^{-1}(z_0). In general this can be done by adding the Lagrangian multiplier \lambda in the Hamiltonian system (e.g., Hartmann and SchĂźtte, 2005 ). The main idea is to modify the Hamiltonian by V^{constraint}(\theta) = V(\theta)+ \lambda (\xi(\theta)- z_0) . Due to the very first reason I mentioned in the beginning, I know we probably do not have strong motivation to implement it.
Soft constraint
Another strategy is to put a soft constraint. For example, we put a strong prior on \xi(\theta): \xi(\theta) \sim N (z_0, \tau ^2 \mathrm{Id}_{m\times m} ). That is target +=
\frac{-1}{\tau^2} (\xi(\theta)-z_0)^2.
In the limit case \tau\to 0, the marginal density of \xi(\theta) (the image of measure \nu by \xi) will be guaranteed a Dirac measure at z_0.
OK, only having target +=
\frac{-1}{\tau^2} (\xi(\theta)-z_0)^2 is wrong because we do not take into account the variable transformation.
Intuitive, no matter how small \tau is, the dynamic still see the variation around the surface \Sigma. Indeed in the limits case \tau \to 0, it converges to
which can also be rewritten as
where G is the gram matrix defined earlier (proof see e.g., Ciccotti et al 2008).
The term \frac{1}{2} \log det G is named Fixman correction (Fixman 1974).
Adding a Jacobian?
From my understanding, Stan recommends adding the log absolute det jacobian into the log density, whenever we have a model on the transformed parameters z.
It is somewhat vague what it means. Jacobian means the derivative to most users. Sure, when \xi(theta) is a bijection and \nabla \xi(\theta) is a squared matrix,
is an exact identity. However such bijection is impossible in reality, because it means we put a soft constraint on all parameters, and in the limit case the joint posterior is degenerated to a Dirac measure purely due to prior.
Other than it, \xi is map from R^d to R^m with d>m, if not always much larger. From the old post 1, and old post 2
That is, augment the transformed parameters z^{*}=(z, z^{aug}), so as to make it one-to-one transformation between \theta \to z^{*}. And we calculate the jacobian and marginalize out the auxiliary variables.
The augmented Jacobian reads
You should be convinced that \log |det J^{aug}| \neq 1/2 \log det G, no matter how they look like.
If we are talking about variable-transformation now, I totally agree. It like log-normal or inver-gamma. We put a prior on the transformed parameters, as if we generate such transformed parameters (e.g. normal), and transform back to the original space (log-normal).
Generative vs Embedding, or change-of-variable vs conditioning
They are different.
Stan manual write:
A transformation samples a parameter, then transforms it, whereas a change of variables transforms a parameter, then samples it. Only the latter requires a Jacobian adjustment.
Agree, we need a Jacobian as if we know the generative distribution in the transformed parameters space, and we want to transform back to its original space. In principle, we do not even need any extra prior on \theta.
But there is another situation: which I will call embedding. We know the distribution in \theta space \nu, but we want to find what is it embedding/conditional distribution on the submanifold \Sigma_{z_0} induced by those transformed parameters. The section above shows that in order to keep invariance, we need an adjustment by 1/2 \log det G, not Jacobian. Notice that we do not perform change-of-variables.
Prior-Prior conflict?
I donât think I have made a clear clarification. Though conceptually easy, the confusion between the gram matrix and Jacobian seems common, even for experts in molecular-simulation experts (see Otter 2013 and Wong and York 2012 for a very similar argument)
I have not discussed about the one-to-many transformation, or d>n. In the limit case it is over-identified. But there is nothing prevent me defining a regularization on all \theta, and \theta^2, and \theta^m \sim N(0,\alpha^m) . In that case I know the distribution of \theta, and I am loosely regularize all moments not so being wild. I do not even ask for consistency as \tau is fixed and is not close to 0. I think it is fine to leave all constraints by itself, without further adjustment. This is to echo what I have pointed put earlier: we do have the freedom to choose to stand in the ecludian space, or the surface.
We may also wonder, is it a prior-prior conflict, if we define a prior on theta space, and define another prior/embedding on the transformed parameters. Typically it is hard to detect as sum of multiple convex function is still context. But if m>d, the sub-manifold can be empty in the limit.
Conclusion
- For variable transformation (e.g. constraint parameters, simplex,âŚ) and generative prior defined by transformed parameters (e.g. log-normal, inv-cauchy), jacobian is always correct, and the ONLY correct option. There is a natural bijection, so the jacobian is well-defined.
- The purpose of embedding is to put soft constraint on some reaction coordinate, and it does not change variables. If it is the case, in order to keep asymptotic consistency, the log density should be adjusted by 1/2 log det G, not jacobian.
- Hard-constraint is another less-recommend but possible option for general transformations.
- When the soft constraint itself is loose (e.g. z ~N(0,2)), it only represents some approximate regularization. Lack of 1/2 log det G adjustment should also be reasonable.
Some questions I am not clear:
the embedding interpretation and importance-sampling; when should we put either adjustment in the optimization for MAP.
[edit: fixed typo]