Understanding non-bijective variable transformations

I understand how variable transformations work when the mapping is a bijection (correct with log determinant of the Jacobian, using a change of variables argument one can derive with basic calculus and linear algebra), but I would like to grasp the theory and practicalities for transformations which are not bijections, since the latter are not always practical (singularities, etc).

My understanding is that the only built-in non-bijective transformation Stan has is the unit vector transformation, but from reading the manual I could not get a general understanding.

I don’t have a background in differential geometry, so suggestions of intro texts / lecture notes / slides that would help me understand this for the purpose of Bayesian inference would be greatly appreciated.

Perhaps the best introductory text for this stuff (not specific to non-bijective transformations though) is @betanalpha’s Probability Theory (For Scientists and Engineers).

The easiest route to understanding this stuff is to think of the prior statements in the Stan model not as “x is drawn from some distribution (like a normal distribution)” but as “increment the log probability by some function (like a normal log probability density function evaluated at x)”. Viewed in this way, we see that the prior is, at its most general, some arbitrary function over the unconstrained parameters. Our goal is to build a prior that adequately captures some domain expertise and prior uncertainty. When our domain expertise is naturally expressed over a transformed parameter, then a natural way to construct the prior function is to use a probability density function evaluated at the transformed parameter, plus a Jacobian adjustment. However, if the transformation isn’t bijective, then this nifty trick doesn’t work, and we need to think in more detail about how to construct a prior function according to our domain expertise.

Some specific discussion of this issue in the context of centered vectors (i.e. vectors with a hard sum-to-zero constraint can be found in the Stan Users Guide here 1.7 Parameterizing centered vectors | Stan User’s Guide

2 Likes

@jsocolar: I don’t see anything about this in either source you linked.

From the texts I looked at I suspect that the generalized adjustment is the determinant of the Riemannian metric. If this is correct, I am looking for a text that explains enough diff geometry to understand this at the “user level”.

(Sorry if the question is trivial to someone who has a background in differential geometry, I don’t).

Sorry if I’m being slow here; you don’t see anything about what exactly?

The first link is a good treatment of the probability theory to bring people like me (I also have no background in differential geometry) up-to-speed. The link from the Stan Users Guide has some fairly extensive discussion about how to achieve priors whose pushforwards for the margins of the transformed parameter have some desired form (as long as that form is Gaussian).

Edit: it’s quite possible that your question is more detailed or technical in a way that’s going over my head; my apologies if so!

Edit2: Note also that there are two subtly different things that you might want to achieve with a Jacobian adjustment. Taking the centered vector case as an example, denote the non-bijective transformation as F. For an arbitrary probability density function \pi, then we might want to ensure that, for some adjustment A that depends only on F and \theta, incrementing the target by A + \log(\pi(F(\theta))) will yield a prior pushforward density on F(\theta) corresponding to \pi. The trick here is that this getting a pushforward proportional to \pi on the image of F is not the same thing as getting margins of the prior pushforward distribution on the image of F equal to the margins of \pi on the domain of \pi. That’s what all the discussion in the linked part of the Stan Users Guide is about: if we want to specify a prior via the expected margins on the transformed parameter, then how do we achieve that? This is hard in general and Jacobian adjustments cannot help us. Sometimes it’s impossible (for example, say we declare a 2-dimensional simplex and ask for prior pushforwards over its elements where one of the margins is not identical to the pushrward on one minus the other element).

Probably @betanalpha will have something interesting to add here!

1 Like

@tpapp check out add Givens-based orthonormal matrix transform and inverse plus Jacobian · Issue #2590 · stan-dev/math · GitHub. This stuff is hard!

1 Like

@tpapp sorry for the flurry of edits and messages. Just want to add one more thing:
If the transformation F is one-to-one but is not onto, then you can do a bijective transformation with a Jacobian adjustment such that you can then project down to the image of F via a linear transformation. I’m reasonably certain that if you then increment the target by \log(\pi(F(\theta))) and the appropriate Jacobian adjustment, you will end up with a pushforward on the image of F proportional to \pi. But again, the margins of this pushforward will be proportional to the margins of \pi evaluated over the image of F, not the margins of \pi evaluated over the domain of \pi.

This doesn’t cover the case where F is not one-to-one.

I appreciate that you are trying to help, but that’s the case I am interested in — please kindly read the post.

In any case, I think if I want this purely for MCMC, and I can extend the transformation to a bijection, adjusting with the (log) determinant of its Jacobian is sufficient. Consider

I(f) = \int_D f(x) p(x) dx \approx \frac1N \sum_{i = 1}^N f(x_i)

with D \subseteq \mathbb{R}^n, where I know how to obtain points x_i \sim p (to wit, MCMC, usual caveats apply about convergence of this approximation).

Now suppose I have some transformation g: \mathbb{R}^m \to D \subseteq \mathbb{R}^n, m \ge n. Extend this to \tilde{g}: \mathbb{R}^m \to D \times \mathbb{R}^{m - n} so that

g_i(y) = \tilde{g}_i(y) \qquad \forall y \in \mathbb{R}^m, 1 \le i \le n

and \tilde{g} is (almost everywhere) a bijection. Similarly introduce \tilde{f}, \tilde{p} that take a vector, and evaluate f or p respectively on the first n coordinates. Then I am really after

I(f) = \int_\mathbb{R^m} \tilde{f}(\tilde{g}(y)) \tilde{p}(\tilde{g}(y)) \det(\tilde{g}_y(y)) dy \approx \frac1N \sum_{i= 1} f(g(y_i)) \det(\tilde{g}_y(y_i))

where the samples y_i come from MCMC under p.

The practical bottom line is that if I can extend my transformation to a bijection (a.s.), I just use the Jacobian of that for adjustment, and everything should work out. I can even use the bijection that gives me the nicest Jacobian for the algebra, it should not matter, and then I just code it up in Stan and things work (provided my transformation has no pathologies relevant for my application, as usual).

The only example you gave in your post of a non-bijective transform was the simplex transform, which is one-to-one.

1 Like

Nope, I mentioned the unit vector transformation, which isn’t bijective.

Again, please kindly read what you are responding to. It’s perfectly acceptable not to contribute if you can’t, but adding noise to a discussion is definitely not helping.

Apologies for misremembering that you had written about a simplex transform (which also isn’t bijective, but is one-to-one) rather than a unit-vector transform.

1 Like

Before commenting I just wanted to say that most of @jsocolar’s comments have been reasonably on target here, especially given the vagueness/generality of the question and the lack of anyone else commenting, and consequently this response comes across far more hostile than it needs to be.

In theory there’s only one way to transform probability distributions – through pushfoward operations. Let X be the input space equipped with a probability distribution \pi, let Y be the output space, and let \phi be a transformation between the two, \phi : X \rightarrow Y.

If \phi is a measurable transformation then for all well-behaved sets in the output space, B \subset Y, the pullback sets, \phi^{-1}(B) \subset X, will be well-behaved sets in the input space. The details of what “well-behaved” means and how the pullback sets are defined is are discussed in Section 1.4 and 2.2 of the aforementioned piece on probability theory, Probability Theory (For Scientists and Engineers).
From here on in I’m going to presume that all of the sets are unambiguously “well-behaved”.

In this case the probability distribution \pi defined on X induces a pushforward distribution \phi_{*} \pi on Y by the probability allocations

\mathbb{P}_{\phi_{*} \pi}[B] = \mathbb{P}_{\pi}[\phi^{-1}(B)]

for any well-behaved set B \subset Y.

Equivalently the pushforward distribution can be defined in terms of expectation values – the pushforward distribution is the unique distribution that satisfies

\mathbb{E}_{\phi_{*} \pi}[g] = \mathbb{E}_{\pi}[g \circ \phi]

for all functions g : Y \rightarrow \mathbb{R} with well-defined expectation values. This latter form will typically be more useful when trying to work out how various representations of a probability distribution transform.

The dimensions of X and Y don’t have to be the same for \phi to be measurable. For example any well-behaved projection function_ from a higher-dimensional space X into a lower-dimensional space Y \subset X will be measurable. When X is a product space, X = Z_{1} \times Z_{2} and \phi is the projection function X \rightarrow Z_{1} or X \rightarrow Z_{2} then the pushforward distribution is also known as a marginal distribution. That said people are often sloppy about the term “marginal” and use it to refer to any pushfoward distribution from a higher-dimensional space to a lower-dimensional space. If \phi is a bijection between two real spaces of equal dimension then it is typically referred to as a reparameterization.

This is all a bit abstract, but that abstraction allows us to unambiguously understand how representations of probability distributions like probability density functions and samples transform.

For example let X and Y be real spaces, each with a fixed parameterization/coordinate system, and hence a fixed uniform volume measure which allows us to define probability density functions. Then the pushforward condition becomes

\begin{align*} \mathbb{E}_{\phi_{*} \pi}[g] &= \mathbb{E}_{\pi}[g \circ \phi] \\ \int_{Y} \mathrm{d} y \, \phi_{*} \pi(y) \, g(y) &= \int_{X} \mathrm{d} x \, \pi(x) \, g \circ \phi(x). \\ \int_{Y} \mathrm{d} y \, \phi_{*} \pi(y) \, g(y) &= \int_{Y} \mathrm{d} y \left[ \int_{X} \mathrm{d} x \, \pi(x) \, \delta(y - \phi(x)) \right] g(y), \end{align*}

where \delta denotes the Dirac delta function. Consequently the pushforward density function can be defined by

\phi_{*} \pi(y) = \int_{X} \mathrm{d} x \, \pi(x) \, \delta(y - \phi(x)).

In general this integral is not traceable, although in a few special cases it reduces to more familiar forms. For example when \phi is a product space projection function \phi: X = Z_{1} \times Z_{2} \rightarrow Z_{1} then

\begin{align*} \phi_{*} \pi(z_{1}) &= \int_{Z_{1} \times Z_{2}} \mathrm{d} z_{1} \mathrm{d} z_{2} \, \pi(z_{1}, z_{2}) \, \delta(z_{1} - \phi(z_{1}, z_{2})) \\ &= \int_{Z_{2}} \mathrm{d} z_{2} \, \pi(z_{1}, z_{2}); \end{align*}

in other words we “integrate out the nuisance variable”. Similarly if \phi is a reparameterization then we can use the properties of the delta function to derive

\begin{align*} \phi_{*} \pi(y) &= \int_{X} \mathrm{d} x \, \pi(x) \, \delta(y - \phi(x)) \\ &= \int_{X} \mathrm{d} x \, \pi(x)) | J(x) | \, \delta(\phi^{-1}(y) - x) \\ &= \int_{X} \mathrm{d} x \, \pi(\phi^{-1}(y))) | J(\phi^{-1}(y)) |, \end{align*}

which is the usual “change of variables” equation.

What about samples? Here the pushforwad condition implies

\begin{align*} \mathbb{E}_{\phi_{*} \pi}[g] &= \mathbb{E}_{\pi}[g \circ \phi] \\ \lim_{n \rightarrow \infty} \frac{1}{N} \sum_{n = 1}^{N} g(y_{n}) &= \lim_{n \rightarrow \infty} \frac{1}{N} \sum_{n = 1}^{N} g \circ \phi (x_{n}). \end{align*}

In other words if

(x_{1}, \ldots, x_{N}, \ldots)

is a sample from X then

(\phi(x_{1}), \ldots, \phi(x_{N}), \ldots)

is a sample from Y! This is extremely straightforward to implement in practice, and indeed it’s one of the reasons why sampling methods are so uniquely well-suited to exploring marginal/pushforward distributions.

Now we can reflect on some of Stan’s variables types from this context.

Once we take into account the summation constraint a D-dimensional simplex is actually only (D - 1)-dimensional and so we can transform to a (D - 1)-dimensional space where the variables are more decoupled and work out the corresponding pushforward density function with a Jacobian determinant.

The space of D-dimensional unit vectors is not a real space and hence can’t be globally parameterized in terms of real numbers/variables. It can, however, be defined as the image of a map from a higher-dimensional real space, in particular an embedding map from \mathbb{R}^{D + 1} to a sphere with any fixed radius. Stan’s unit_vector type is actually implicitly defined as a pushforward from \mathbb{R}^{D + 1}. When you add a unit_vector declaration to the parameters block the compiler automatically increments a probability density function over \mathbb{R}^{D + 1} – I believe that it’s a product of zero-centered normal density functions – that pushes forward to a uniform probability density function over a sphere with radius R = 1.

When one runs Stan it generates Markov chain samples over \mathbb{R}^{D + 1} which can then be automatically mapped to samples from the D-dimensional unit sphere. That said once likelihood functions are introduced the geometry of the posterior density function over the latent \mathbb{R}^{D + 1} space can be surprisingly weird, especially when D is large, so this isn’t always a great solution.

I’m not sure to what “singularities” refers. So long as the transformation \phi : X \rightarrow Y is sufficiently measurable then the pushforward operation will be well-defined, and \phi has to be pretty messed up for it to not be measurable.

Anyway if that didn’t address all of the questions then let me know.

6 Likes

Which is zero. Zero hostile is what it needs to be.

2 Likes

@jsocolar, I apologize if my tone was grouchy (agreeing with @maxbiostat that the only valid level is zero). I felt that my question as not understood, and I realize that you were trying to help.

Generally, I am looking for a practical solution to the following situation: the user has a model where some parameters reside in a manifold that for which there is no hardcoded transformation in Stan, and the bijections are infeasible or have undesirable properties. To explore the parameter space with NUTS, one needs a transformation from \mathbb{R}^n, how can the user conveniently implement this correctly?

One can of course always derive solutions from first principles (map, integrate to CDF, obtain the multivariate derivative). This can be tedious, though. One nice special case is (a.s.) bijections, for which we have a theorem that tells us to adjust by the abs determinant of the Jacobian.

The question is if there is something similar for non-bijections? As I wrote above, if one can extend mappings to bijections with helper/nuisance variables then that works, so effectively I have an answer — one can always crank through the Jacobian in the worst case. Just curious if there are other techniques.

1 Like

What do you mean by the bijections? If I understand properly, you already correctly point out that one can basically “pad” out the transform so that bijection is preserved. Of course, as you also point out this is not exactly great because you still get a high dimensional space to explore, and one for which geometry might end up being nastier. But the key here is that, at least in principle, the padding is yours to choose.

I appreciate that you are trying to explore other alternatives should they exist (I also don’t know of any other techniques), but I do believe one can get a lot of mileage out of carefully considering alternative ways of doing the “matching”.

Maybe @yuling has some insight.

For some problems one can come up with (almost) bijections that have undesirable properties. Eg for the unit sphere, you can use spherical coordinates; but you run into the issues mentioned above.

I think it depends on how the transformed parameter is used. My understanding is that is that practically if we want to put a prior on a transformed parameter z=f(\theta) in which \theta is the sampling parameter and f is a smooth m to n mapping for m>n, then we can add target += -1/2 log det G, for G= (\nabla_{\theta} f)^T\nabla_{\theta} f .

2 Likes

Some references, although I not sure how helpful they will be in terms of getting a sampler up and running:

Au, Chi, and Judy Tam. “Transforming Variables Using the Dirac Generalized Function.” The American Statistician 53, no. 3 (1999): 270–72.

Khuri, André I. “Applications of Dirac’s Delta Function in Statistics.” International Journal of Mathematical Education in Science and Technology 35, no. 2 (March 2004): 185–95. https://doi.org/10.1080/00207390310001638313.

2 Likes

Thanks! I was familiar with the application of Dirac’s \delta to \mathbb{R}^n \to \mathbb{R} transformations, but not the multivariate extension in Khuri (2004), eq (13).

Thank you. I can see that using the Jacobian determinant for bijections is a special case of this. Does this result have a name so that I could read up on it?

Generally there is not a(n immediate) practical solution.

Firstly we need to be more precise. Manifolds are very general spaces and cover every possible application of Hamiltonian Monte Carlo; in particular all real spaces are manifolds.

Consequently there are a few ways to interpret the problem:

  1. A user wants to define a model over a manifold not isomorphic to some real space.
  2. A user wants to define a model over a space implicitly defined by some real space \mathbb{R}^{N} and a sufficiently nice constraint function c: \mathbb{R}^{N} \rightarrow \mathbb{R}. For example the implicit space is defined by the kernel of the constraint function, c^{-1}(0). That implicit space might be non-Euclidean or not.

(1) is not possible directly.

Recall that Stan provides implementation of Hamiltonian Monte Carlo for real spaces. In particular Stan requires that the target distribution is defined by a probability density function, which requires global coordinates that span the entire ambient space. This immediately restricts the application of Stan to certain topological spaces – spheres, torii, Stiefel manifolds, and the like are all inadmissible because of their topological properties ( none of these spaces admit global coordinates, i.e. they are not isomorphic to \mathbb{R}^{N} for any N).

Because of its geometric foundations Hamiltonian Monte Carlo can be implemented on any smooth manifold, including those not isomorphic to a real space. The implementations are more complicated (building symplectic integrators over these spaces requires being able to calculate evolution along geodesic curves, states have to be saved in local charts, etc), but perhaps a bigger limitation is that valid expectation values, and hence natural posterior summaries, become much more subtle (no natural moments, cumulants, quantiles). For these reasons and more Stan has maintained that limited scope to real ambient spaces.

What is possible?

Well if the target distribution strongly concentrates in some subset of the ambient space then it might be well-approximated by a target distribution that is exactly zero outside of that subset. If that patch of the ambient space is isomorphic to some real space then the approximate target distribution can be implemented in Stan. In differential geometric language this means that the target distribution concentrates within a single chart, so that the model can be implemented within that one chart (the chart being the formal name for those “almost bijections” mentioned in a few places in the thread).

Alternatively it might be possible to embed the ambient space into a higher-dimensional space that is isomorphic to a real space (mathematically this is always possible, but it might not be obvious how to do it in practice). If the model can be lifted to that higher-dimensional real space then it can be implemented there directly. This is how the unit vector type is implemented in Stan.

(2) is almost the opposite problem where we start on a high-dimensional space.

While not implemented in Stan there are Hamiltonian Monte Carlo implementations that can be defined for embedded spaces implicitly defined by a constraint function; see for example Markov Chain Monte Carlo on Constrained Spaces which uses the RATTLE symplectic integrator to generate transitions confined to the embedded space. The practical performance of these methods, however, will depend on the complexity of the embedded space.

If the implicit space is isomorphic to a real space then it will be possible to parameterize the embedded space directly. If the model is defined over the (higher-dimensional) embedding space, however, then one has to figure out how to project the higher-dimensional density function down to a lower-dimensional density function which goes back to those nasty integrals I mentioned above.

If the implicit space is not isomorphic to a real space then we’re back to the problems introduced in (1).

3 Likes

Thank you for your detailed answer.

To clarify the original question: I would like to learn more about the practical part of implementing the solution above. A reading list for a non-mathematician who has taken “user level” courses in multivariate analysis, linear algebra, and measure-theoretic probability would be very much appreciated. I understand that the subfield of mathematics that would answer this for me is (probably) differential geometry, but it is a vast field which I find difficult to navigate.