Letâs back up a little bit because all of the talk about injectivity, bijectivity, images, and codomains is missing some important points. The problem with trying to understand the change of variables formula and its limitations is that it requires a deep dive into probability theory. Iâll try to do that here, but itâll take a while. If any of the below concepts are confusing to anyone reading along then I recommend taking a look at my probability theory case study, Probability Theory (For Scientists and Engineers).
Before talking about maps letâs make sure weâre on the same page with the basics. A probability space consists of an ambient space X, endowed with a \sigma-algebra \mathcal{X} consisting of âniceâ subsets of X and a probability distribution \pi that maps elements of the \sigma-algebra into probabilities in way thatâs compatible with countable unions, intersections, and complements.
Now letâs consider another space Y equipped with its own \sigma-algebra \mathcal{Y} along with a map F: X \rightarrow Y.
Nominally F just maps points in X to points in Y but this point-wise mapping can also induce maps from objects defined on X to objects defined on Y. For example by breaking a subset A \subset X into points and then mapping them to Y before collecting those output points in other subset F(A) \subset Y the original map F induces a map from subsets on X to subsets on Y. This kind of induced map in the same direction of F is called a pushforward along F.
At the same time F might also induce maps from objects defined on Y to objects defined on X. If F isnât bijective then we canât define an inverse point-wise map F^{-1} : Y \rightarrow X, but we can we can define a map from subsets B \subset Y to subsets F^{-1}(B) \subset X. This kind of induced map in the opposite direction of F is called a pullback along F.
So the point-wise map F induces both a pushforward and pullback map between subsets on X and Y. These induced maps, however, will not in general respect the \sigma-algebras. In particular if A \in \mathcal{X} then the output of the pushforward map F(A) need not be in \mathcal{Y}, and vice versa for the pullback map.
If the pullback map is compatible with the \sigma-algebras so that for every B \in \mathcal{Y} we have F^{-1}(B) \subset \mathcal{X} then we can define another induced pushforward map, this time between probability distributions. Every probability distribution \pi defined on X defines a pushforward probability distribution F_{*} \pi on Y via the probabilities
\mathbb{P}_{F_{*} \pi}[B] = \mathbb{P}_{\pi}[ F^{-1}(B) ].
Again we need F^{-1}(B) to be in \mathcal{X} otherwise the initial probability distribution wonât know how to assign a probability to the pullback subset.
Measurable functions/maps/transformations are just the maps satisfying the compatibility requirement that allows us to define pushforward probability distributions. In other words measurable maps are the only maps that allow us to translate probability distributions from one space to another.
Note that at this point no other requirement has been made on the structure of X, Y, and F. X and Y donât have to have the same dimensions, F doesnât have to be bijective or even injective so long as it satisfies the \sigma-algebra consistency property.
If the dimension of Y is less than the dimension of X then a measurable surjection F : X \rightarrow Y is commonly known as projection map, and pushforward distributions are known as marginal distributions.
If the dimension of X and Y are the same and both F and F^{-1} are measurable then a bijection F: X \rightarrow Y is commonly known as a reparameterization.
(Side note: codomains are irrelevant here as the \sigma-algebras and probability distributions of interest are all defined over the entire domain).
They key difference between these two types of maps is that projections loose information while reparameterizations do not. If F is a reparameterization then we can start at \pi on X, pushforward to F_{*} \pi on Y, then pushforward along F^{-1} to recover the original distribution,
(F^{-1})_{*} F_{*} \pi = \pi.
This is not true of projection functions â we can map \pi on X to F_{*} \pi on Y but thereâs no way to recover \pi from that pushforward distribution.
Okay, so now weâre finally ready to talk about probability density functions. Probability density functions are functions that quantify the difference between two measures. Mathematically we denote the density function of \pi_{2} with respect to \pi_{1} as
\pi_{21}(x) = \frac{ \mathrm{d} \pi_{2} }{ \mathrm{d} \pi_{1} } (x).
Most often we correct some standard âuniformâ distribution on the ambient space to the probability distribution of interest. If X is a real space then that uniform distribution is the Lebesgue measure, \mathcal{L}. In other words the probability density function of \pi is actually the probability density function of \pi relative to the Lebesgue measure,
\pi(x) = \frac{ \mathrm{d} \pi }{ \mathrm{d} \mathcal{L} } (x).
Using the above machinery we can in some cases work out how to construct pushforward probability density functions. The basic idea is to take a distribution on X, push it forward along F to F_{*} \pi on Y and then construct the density of each with respect to the uniform measures on X and Y respectively. In other words
\pi(x) = \frac{ \mathrm{d} \pi }{ \mathrm{d} \mathcal{L}_{X} } (x) \mapsto \pi(y) = \frac{ \mathrm{d} F_{*} \pi }{ \mathrm{d} \mathcal{L}_{Y} } (y).
Notice that we pushforward \pi along F but we define the densities with respect to the uniform distributions on X and Y respectively. We donât transform the uniform distribution on X to some distribution on Y because that pushforward distribution will in general no longer be uniform! Indeed when F: X \rightarrow Y is a measurable bijection the amount by which F warps the initial uniform distribution is just the Jacobian determinant!
Mathematically when F is a bijection we can write
\begin{align*}
\pi(y)
&= \frac{ \mathrm{d} F_{*} \pi }{ \mathrm{d} \mathcal{L}_{Y} } (y)
\\
&= \frac{ \mathrm{d} F_{*} \pi }{ \mathrm{d} F_{*} \mathcal{L}_{X} } (y)
\cdot \frac{ \mathrm{d} F_{*} \mathcal{L}_{X} }{ \mathrm{d} \mathcal{L}_{Y} } (y)
\\
&= \pi(F^{-1}(y)) \cdot | J |(y)
\end{align*}
which is exactly the usual âchange of variablesâ formula thatâs pulled out of thin air.
When F is a surjection then the density of the pushforward uniform distribution from X relative to the uniform distribution on Y, \mathrm{d} \mathcal{L}_{X} / \mathrm{d} \mathcal{L}_{Y} is singular and so the usual change of variables formula cannot be applied. In these cases working out the pushforward probability density functions, or the marginal density functions, is much, much harder and usually cannot be done analytically.