Zero-inflated probability distribution

I took a look at this

hurdle models wrap a lot of really subtle probability theory, which is why they can be so confusing once you look a little deeper.

I’m trying to dig a little deeper with some formalism, and indeed I am confused.

So I’m thinking about the representing the observational space as

X \times Y

where X := \{0, 1\}, Y := \mathbb{R}. The measure is v(z) := \begin{cases} \text{if }x = 0, y = 0 \text{ then } (1 - p)\\ \text{if }x = 1, y \in A \text{ then } p \int_{A} f_{y}(y) dy\\ \text{if }x = 0, y \neq 0 \text{ then } 0 \end{cases}

So x is a latent variable representing the inflated probability of a zero, y is our measurement with some conventional density. To define the total density in terms of the Radon-Nikodym theorem, we use the following dominating measure

\mu(z) := \begin{cases} \text{if } y = 0 \text{ then } c \circ \omega_{x}(z)\\ \text{if } y \neq 0 \text{ then } \mathcal{L} \circ \omega_{y}(z) \end{cases}
with c the counting measure on X, \mathcal{L} the lesbesgue measure on Y, and \omega_{x}, \omega_{y} the projections into X, Y respectively.

This motivates the density

f(x, y) := \text{indicator}_{x = 0, y = 0} (1 - p) + \text{indicator}_{y \neq 0} p f_{y}(y)

which integrates against our dominating measure to recover the original probability measure and projects down to the likelihood defined in the original thread.

Problem is, while usually the counting measure and the lesbesgue measure are obvious choices for a dominating measure on discrete and continuous variables, I pulled this dominating measure out of my hat. This raises the questions for me:

  1. Are probability densities in the measure theoretic sense only well-defined up to a choice of dominating measure?
  2. If they do depend on the dominating measure, is there some canonical way of choosing one?
  3. If there isn’t, how does all this work with probabilistic computation? Why is this particular density “correct” for computation with stan as opposed to others?

I’m self-taught on statistics here so please excuse any naive errors. Corrections encouraged, even on minutiae! Your criticisms only increase my power level.

1 Like

These are very, very important questions!

Exactly!

Yes! As you note probability density functions are defined by the Radon-Nikodym theorem, which always involves two measures. In practice the symmetry of those two arguments is often broken by taking one of those measures be a “base” measure (or dominating measure), but that’s a consequence of how we use the Radon-Nikodym theorem not the theorem itself.

Unfortunately the base measure is often taken for granted, which is why it’s so easily forgotten. People say “probability density function of a distribution \pi” instead of “probability density function of a distribution \pi with respect to the base measure \nu”. But it’s this choice of base measure that determines for example how probability density functions transform; see for example my comment at Why transformations need to be invertible in change of variable in probability theory? - #11 by betanalpha.

Not without imposing additional constraints on any given problem.

Given a probability distribution \pi there will in general be an infinite number of measures \nu that dominate \pi and hence serve as a valid base measure. In some cases the structure of the ambient space over which \pi is defined motivates particular measures, but those measures are not unique in their ability to help represent \pi.

For example the structure of discrete spaces naturally motivates the counting measure. Probability density functions with respect to the counting measure are also known as probability mass functions. The counting measure has the nice property of being uniform over the countable elements of any discrete space, and in some sense it’s the notion of uniformity that elevates it above any other measure. This uniformity is also invariant to 1-1 transformations – on a discrete space any 1-1 transformation can be decomposed into permutations, and permutations don’t affect the counting measure.

On real spaces the equivalent notion of uniform is given by the Lebesgue measure, and indeed when someone says “probability density function” they almost always mean “probability density function with respect to the Lebesgue measure” whether they are aware of it or not!

Unfortunately the Lebesgue measure doesn’t transform as nicely as the counting measure. Under a general 1-1 transformation or “reparameterization” \phi: X \rightarrow Y the Lebesgue measure on X does not transform into the Lebesgue measure on Y! This is because the Lebesgue measure is defined by the local metric structure of a real space and that structure isn’t preserved under non-linear transformations; another way to think about it is that there isn’t one real space but rather an infinite number of them, each with their own Lebesgue measure.

Consequently when we transform from X to Y we have two different probability density functions that we might consider! There’s the Radon-Nikodym derivative of the transformed distribution \phi_{*} \pi with respect to the Lebesgue measure on X transformed to a measure on Y, \phi_{*} \mathcal{L}_{X},

\pi_{1}(y) = \frac{ \mathrm{d} \phi_{*} \pi }{ \mathrm{d} \phi_{*} \mathcal{L}_{X} } (y),

and the Radon-Nikodym derivative of the transformed distribution \phi_{*} \pi with respect to the Lebesgue measure on Y,

\pi_{2}(y) = \frac{ \mathrm{d} \phi_{*} \pi }{ \mathrm{d} \mathcal{L}_{Y} } (y).

Because \phi_{*} \mathcal{L}_{X} and \mathcal{L}_{Y} are in general different measures these are different functions! In fact the difference between these functions is exactly the “Jacobian determinant” that shows up in discussions of “change of variables”.

It’s only be explicitly recognizing what base measure is being considered that we can make sure that we’re doing this bookkeeping correct. Otherwise we’re just trying to pattern match poorly-motivated rules!

This question is why I yell so much about these seeming irrelevant mathematical technicalities.

Probabilistic computational algorithms estimate expectation values of a given probability distribution. Probability density functions with respect to various base measures can be used to evaluate these expectation values with integrals (sometimes known as the “Law of the Unconscious Statistician” but statisticians are weird) but the density functions themselves don’t define them. In particular the expectation values will always be the same no matter which base measure we choose.

In other words any well-defined probabilistic computational method should depend only on the invariant properties of a probability distribution! Any dependence on a particular probability density function is an undesirable artifact of the method or its implementation!

With this insight it becomes much easier to analyze probabilistic computational methods. For example Monte Carlo and Markov chain Monte Carlo use samples drawn from a probability distribution which have nothing to do with any base measure or probability density function! Once the samples have been generated Monte Carlo and Markov chain Monte Carlo estimators are completely independent of densities. On the other hand how the sample generation is implemented can very well depend on a particular chosen base measure, and that dependence manifests in awkward tuning problems for the method.

Ultimately probability density functions (with respect to a particular base measure!) are useful ways to represent probability distributions in practice, but only at the cost of introducing irrelevant structure that we have to make sure does not corrupt our probabilistic computations.

I emphasize many of these points in my writing – checkout in particular Probability Theory (For Scientists and Engineers), Rumble in the Ensemble, and Probabilistic Computation if you haven’t had a chance yet.

2 Likes

I see, so the inferences come out the same for any density that preserves our desired probability measure in the integration. So stan doesn’t need to know what our reference measure is to do the sampling against an arbitrary density (though I understand it has some trouble with discrete parameters, so I guess it doesn’t handle the counting measure well)? That’s cool, will have to do some more reading, probably of your papers and guides, to understand how that works. Thanks!

Correct. That’s why Stan can handle constraints by transforming to an unconstrained space, sampling, and then transforming the samples back to the constrained space. So long as the transformations are accounted for correctly the inferences are exactly the same.

Ah, but a Stan program does specify the reference measure! It’s the Lebesgue measure over the real space defined in the parameters block.

You might hear people say that if you don’t add prior statements then Stan “defaults to a uniform prior”. That’s not really correct. What’s really happening is that a Stan program starts with the Lebesgue measure over the space defined in the parameters block and then each statement adds densities with respect to that measure. Not adding a density is equivalent to leaving that base measure undisturbed.

This is another great example for why we can go only so far with heuristic understanding.

Most algorithms can actually be framed completely measure-theoretically, which means without referencing densities at all. Densities instead show up in the implementation of those algorithms.

For example in theory Hamiltonian Monte Carlo works with an abstract target probability distribution and can be written without having to specify a base measure and density for the target distribution at all. In practice, however, we almost always implement the Hamiltonian exploration using density functions for both the target distribution and the distribution of momenta which defines the “kinetic energy” (for Hamiltonian systems a base measure on the parameter space automatically defines a base measure for the momenta). The interaction between those two densities for a given base measure then defines the performance of the algorithm.

Ah, so a prior density function is just the radon-nikodym derivative between a measure that represent our prior beliefs about a parameter’s probabilities and the lesbesgue measure. Since the formal statement of bayes theorem is

\frac{d \mu(\theta | x)}{d \mu(\theta)} = \frac{1}{c}\frac{d \mu(x | \theta)}{d \lambda} \propto\frac{d \mu(x | \theta)}{d \lambda}

we integrate against the prior measure to get an expectation

E[f(\theta)] = \frac{1}{c} \int_{\mathbb{R}} \frac{d \mu(x | \theta)}{d \lambda} d \mu(\theta)

which by the measure theoretic analog to the change of variables formula

E[f(\theta)] = \frac{1}{c} \int_{\mathbb{R}} \frac{d \mu(x | \theta)}{d \lambda} \frac{d\mu(\theta)}{d L}d L

so the prior density can be thought of as a sort of jacobian change of coordinates into \mathbb{R}^{n}

1 Like

Kind of sort of.

You’re correct that the formal statement of Bayes’ Theorem refers to Radon-Nikodym derivatives. For example the Radon-Nikodym derivative between the prior distribution and the posterior distribution is the likelihood ratio

\frac{ \mathrm{d} \pi_{y} }{ \mathrm{d} \pi } (\theta) = \frac{ \pi( \tilde{y} \mid \theta) }{ \pi(\tilde{y} ) }.

That likelihood ratio is also equal to the Radon-Nikodym derivatives of the data generating processes in the observational model relative to the prior predictive distribution. Really Bayes’ Theorem just tells us now all of these Radon-Nikodym derivatives are related, which lets us translate between all of the marginal and conditional measures on a product space. See for example Conditional Probability Theory (For Scientists and Engineers).

But the second step that relates the prior distribution to a reference measure and corresponding prior density doesn’t really have anything to do with Bayes’ Theorem per se. It’s just a convenient technique that we can employ to facilitate practical implementations of these abstract distributions.

Finally to be super technical while some density functions can be considered as the Jacobian of some transformation not every density function can be. In particular I’m pretty sure that there’s no map from \mathbb{R}^{n} \rightarrow \mathbb{R}^{n} that pushes forward a Lebesgue measure to some \sigma-finite measure (i.e. a normalizeable measure the the we can scale into a probability distribution) and vice versa. In other words if no such map exists then there’s no way to transform a uniform measure on the reals into a probability distribution and vice versa, and Radon-Nikodym derivatives between a probability distribution and a uniform measure have no Jacobian equivalent.

2 Likes

“σ-finite” does not mean “normalizable,” it means a countable union of sets of finite measure.
But yes, not every density is a Jacobian.

Sorry, I meant finite and not just \sigma-finite.

Yep. It means the ambient space is a countable union of sets of finite measure.

Just to hammer the point home, Lebesgue on \mathbb{R} is \sigma-finite, but not finite, for instance.