Why transformations need to be invertible in change of variable in probability theory?

Hi everyone,

today I have been asking myself about something that although not related directly to Stan, I think it is a topic of interest for Stan users. It is something that for a reason it is not explained (or at least I have not found the explanation) but rather always assumed. I think that this is a great forum to find the answer.

The question is simple. When we have a change of variables, F: X \rightarrow Y, why do we require that the transformation between F must be a bijection?. I guess the answer is because we must require that each element of the set X is identified in each element of the set in Y. In this case, the absolute value of the determinant of the Jacobian is in charge of accounting for the change in volume that F produces.

p(Y) = p(F(X))\left|\det \frac{\partial F}{\partial Y} \right|

The reason I ask this, and apologize if it is a stupid or trivial question, is because I have seen that the general construction in the formula above comes froma more general viewpoint which is the integration by substitution, Integration by substitution - Wikipedia, where the condition required is that F is injective, and not really bijective. In other words, I have seen that this rule we apply in probability theory holds for more general transformations F beyond the bijective ones, and I was wondering why, in the specific case of probability theory, we require our transformation F to be bijective.

Thanks for the answer

2 Likes

An injective function with its codomain restricted to its image is bijective, is it not?

As a first guess, for the integration you do not care about the part of the codomain that is not in the image? “We” also only care about the image of the function?

2 Likes

Yes, I think that if the codomain of a function is the image then the function will be a bijection. But I think that what I exposed above holds without this observation, so not sure about your point.

I think that the fact that you integrate or not does not really make a difference. There is an example in the link I send where under subtitle: “Application to probability theory”

Suppose the function is injective but not bijective. Now we want a Jacobian adjustment that works for an arbitrary prior that we might place on Y. If f isn’t bijective, then there’s a good chance that we will pick a prior that places nonzero probability mass on an element of Y that cannot be mapped back to X. If, on the other hand, we declare Y with an appropriate constraint to ensure no prior mass over the parts that don’t map to X, then f is a bijection.

Edit:
Put slightly differently, the restriction that we need is that our prior doesn’t put any probability density on elements of Y not in the image of f (this is related the integration that @Funko_Unko is talking about). What’s the convenient way to express this restriction? Well, let’s require that f be bijective. If we expand the codomain of f beyond its image, then we just need to immediately crop those extra parts of the codomain back out via the prior.

Note that for any choice of f that is injective but not bijective, with an appropriate prior that puts no probability density outside the image, we can without loss of generality restrict the codomain to the image of f. So we don’t lose anything important by stating the requirement as ‘bijective.’

Note further that for a host of computational reasons (ranging from initialization to floating-point precision near the boundary) it is good practice in Stan to declare parameters with constraints whenever constraints are implied by the prior. So we get better computation by explicitly restricting the codomain to coincide with the image anyway.

2 Likes

Thanks both for the answer. So based on both of them we can somehow conclude that the reason is because we must require that each element of the set X is identified in each element of the set in Y without having to do it by directly expressing constraints through prior probabilities, but rather directly by construction.

1 Like

@betanalpha

It needs to be bijective because you are gonna use the inverse of the function F. Lets compute a probabliliy:

P(F(X) < x) = P( X < F^{-1} (x) )

The only measure space you really know is the one on X and not of F(X) so every tine you need to compute probabilities for F(X) you need to do that step. And therefore you need a bijective function.

This is actually a theorem the theorem of transformation densities for random variables.

Actually, if you are not trying to do a nice global change of variables, F need not be invertible, you could just use its preimage.

Say F(x)=x^2, you can still compute P(F(x)<1) from p(x).

I guess the reason why you want a bijection between X and Y is exactly that you do not want to lose any information you have on X or Y, and hence you need this one-to-one mapping.

Concerning the bijection vs injection, exp: R \to R is clearly an injection but no bijection, but we can just restrict it to exp: R \to R^+, so there’s not really an issue there, or is there?

Edit:

After some more careful checks, I have realized that, as noted also by @asael_am , we can perform change of variables and the only requirement is that the function that performs the change h() is measurable. However, only when h() is either a bijection or an injection, one can use the equation I placed at the beginning of my post, and there are several ways to arrive at it (once is integral by substitution, but also writting P(h(X) \leq y) = P(X \leq h^{-1}(Y)) and then derivate to obtain the density gives the result.

However, I think that beyond this fact, I think it is interesting to know what push us to use bijections and not injections. I guess, as already stated, it is because we want our elements in X to be uniquely determined in our elements in Y. However, beyond this fact, why not just an injection? Is there any other reasons? What could be the implications?

Thanks again

Every injective function corresponds to a bijective function whose codomain is restricted to the image. So the only question here is about how we think about the codomain.

The purpose of doing a Jacobian adjustment is to obtain inference based on some prior density function expressed over the codomain. For example, given some univariate function f(x) whose codomain is the entire real line, suppose that I have prior knowledge that f(x) is normally distributed, and so I write target += normal_lpdf(f(x) | 0, 1). The purpose of the Jacobian adjustment is to ensure that the prior density for f(x) is actually the standard normal.

If f(x) = e^x, then I cannot achieve my desired prior density for f(x)! So the Jacobian adjustment hasn’t worked! Instead of e^x \sim Normal(0,1), it yields e^x \sim RTHN(0, 1), where RTHN is the Right-hand Tail of a Half Normal. This is inconsistent with my domain knowledge and is not the prior that I intended!

If the codomain of f were restricted to the positive reals, then I would have known from the beginning that I couldn’t expect f(x) to be normally distributed, and I would have known that in writing target += normal_lpdf(f(x) | 0, 1) plus a Jacobian adjustment, that I was obtaining a half-normal prior rather than a standard normal.

Let’s back up a little bit because all of the talk about injectivity, bijectivity, images, and codomains is missing some important points. The problem with trying to understand the change of variables formula and its limitations is that it requires a deep dive into probability theory. I’ll try to do that here, but it’ll take a while. If any of the below concepts are confusing to anyone reading along then I recommend taking a look at my probability theory case study, Probability Theory (For Scientists and Engineers).

Before talking about maps let’s make sure we’re on the same page with the basics. A probability space consists of an ambient space X, endowed with a \sigma-algebra \mathcal{X} consisting of “nice” subsets of X and a probability distribution \pi that maps elements of the \sigma-algebra into probabilities in way that’s compatible with countable unions, intersections, and complements.

Now let’s consider another space Y equipped with its own \sigma-algebra \mathcal{Y} along with a map F: X \rightarrow Y.

Nominally F just maps points in X to points in Y but this point-wise mapping can also induce maps from objects defined on X to objects defined on Y. For example by breaking a subset A \subset X into points and then mapping them to Y before collecting those output points in other subset F(A) \subset Y the original map F induces a map from subsets on X to subsets on Y. This kind of induced map in the same direction of F is called a pushforward along F.

At the same time F might also induce maps from objects defined on Y to objects defined on X. If F isn’t bijective then we can’t define an inverse point-wise map F^{-1} : Y \rightarrow X, but we can we can define a map from subsets B \subset Y to subsets F^{-1}(B) \subset X. This kind of induced map in the opposite direction of F is called a pullback along F.

So the point-wise map F induces both a pushforward and pullback map between subsets on X and Y. These induced maps, however, will not in general respect the \sigma-algebras. In particular if A \in \mathcal{X} then the output of the pushforward map F(A) need not be in \mathcal{Y}, and vice versa for the pullback map.

If the pullback map is compatible with the \sigma-algebras so that for every B \in \mathcal{Y} we have F^{-1}(B) \subset \mathcal{X} then we can define another induced pushforward map, this time between probability distributions. Every probability distribution \pi defined on X defines a pushforward probability distribution F_{*} \pi on Y via the probabilities

\mathbb{P}_{F_{*} \pi}[B] = \mathbb{P}_{\pi}[ F^{-1}(B) ].

Again we need F^{-1}(B) to be in \mathcal{X} otherwise the initial probability distribution won’t know how to assign a probability to the pullback subset.

Measurable functions/maps/transformations are just the maps satisfying the compatibility requirement that allows us to define pushforward probability distributions. In other words measurable maps are the only maps that allow us to translate probability distributions from one space to another.

Note that at this point no other requirement has been made on the structure of X, Y, and F. X and Y don’t have to have the same dimensions, F doesn’t have to be bijective or even injective so long as it satisfies the \sigma-algebra consistency property.

If the dimension of Y is less than the dimension of X then a measurable surjection F : X \rightarrow Y is commonly known as projection map, and pushforward distributions are known as marginal distributions.

If the dimension of X and Y are the same and both F and F^{-1} are measurable then a bijection F: X \rightarrow Y is commonly known as a reparameterization.

(Side note: codomains are irrelevant here as the \sigma-algebras and probability distributions of interest are all defined over the entire domain).

They key difference between these two types of maps is that projections loose information while reparameterizations do not. If F is a reparameterization then we can start at \pi on X, pushforward to F_{*} \pi on Y, then pushforward along F^{-1} to recover the original distribution,

(F^{-1})_{*} F_{*} \pi = \pi.

This is not true of projection functions – we can map \pi on X to F_{*} \pi on Y but there’s no way to recover \pi from that pushforward distribution.

Okay, so now we’re finally ready to talk about probability density functions. Probability density functions are functions that quantify the difference between two measures. Mathematically we denote the density function of \pi_{2} with respect to \pi_{1} as

\pi_{21}(x) = \frac{ \mathrm{d} \pi_{2} }{ \mathrm{d} \pi_{1} } (x).

Most often we correct some standard “uniform” distribution on the ambient space to the probability distribution of interest. If X is a real space then that uniform distribution is the Lebesgue measure, \mathcal{L}. In other words the probability density function of \pi is actually the probability density function of \pi relative to the Lebesgue measure,

\pi(x) = \frac{ \mathrm{d} \pi }{ \mathrm{d} \mathcal{L} } (x).

Using the above machinery we can in some cases work out how to construct pushforward probability density functions. The basic idea is to take a distribution on X, push it forward along F to F_{*} \pi on Y and then construct the density of each with respect to the uniform measures on X and Y respectively. In other words

\pi(x) = \frac{ \mathrm{d} \pi }{ \mathrm{d} \mathcal{L}_{X} } (x) \mapsto \pi(y) = \frac{ \mathrm{d} F_{*} \pi }{ \mathrm{d} \mathcal{L}_{Y} } (y).

Notice that we pushforward \pi along F but we define the densities with respect to the uniform distributions on X and Y respectively. We don’t transform the uniform distribution on X to some distribution on Y because that pushforward distribution will in general no longer be uniform! Indeed when F: X \rightarrow Y is a measurable bijection the amount by which F warps the initial uniform distribution is just the Jacobian determinant!

Mathematically when F is a bijection we can write

\begin{align*} \pi(y) &= \frac{ \mathrm{d} F_{*} \pi }{ \mathrm{d} \mathcal{L}_{Y} } (y) \\ &= \frac{ \mathrm{d} F_{*} \pi }{ \mathrm{d} F_{*} \mathcal{L}_{X} } (y) \cdot \frac{ \mathrm{d} F_{*} \mathcal{L}_{X} }{ \mathrm{d} \mathcal{L}_{Y} } (y) \\ &= \pi(F^{-1}(y)) \cdot | J |(y) \end{align*}

which is exactly the usual “change of variables” formula that’s pulled out of thin air.

When F is a surjection then the density of the pushforward uniform distribution from X relative to the uniform distribution on Y, \mathrm{d} \mathcal{L}_{X} / \mathrm{d} \mathcal{L}_{Y} is singular and so the usual change of variables formula cannot be applied. In these cases working out the pushforward probability density functions, or the marginal density functions, is much, much harder and usually cannot be done analytically.

7 Likes

Once the probability distribution is parameterized yielding a probability density function over \mathcal{Y}, is this not precisely equivalent to requiring that the codomain be restricted to the image?

1 Like

No, not necessarily. The requirement is not just that F^{-1}(B) \subset X but rather that F^{-1}(B) is an element of the particular \sigma-algebra \mathcal{X} defined on X. One cannot talk about probability distributions, let alone the transformation of probability distributions, using the the structure of the ambient space alone; one always has to consider the specific \sigma-algebras that accompanies that space.

I didn’t go into this above, but in most cases there is a natural \sigma-algebra to consider based on the topology of the ambient space, known as the Borel \sigma-algebra. When using the Borel \sigma-algebras most maps that preserve topological structure will automatically be measurable. For example in this case continuous, surjective maps are measurable.

That said on the real numbers one has technically to consider not just the Borel but also the Lebesgue \sigma-algebras which are ever so slightly different.

3 Likes

Eres un puto crack!. Thanks for this reply.

2 Likes