Conditioning a random variable on the sum of random variables

Hello, there!

I recently faced a problem with sampling and wanted to share it with you guys. Maybe someone can help me out! Suppose we have two random variables X \sim F_{\alpha} and Y \sim F_{\beta} with known \alpha and \beta. The problem is that we want to sample from X | X + Y, that is, what is the distribution of X given the sum X+Y when both distributions are known?

Well, since we do not have a closed-form expression for the density of this conditional distribution, I think it would be nice to propose a general solution, if possible. It is easy to write a Stan code that does this. Below I show the code when X and Y come from normal distributions. So we give a prior for X and the likelihood is X+Y|X, which is a known distribution. Happily, we can see that the Stan sampler works appropriately because we can compare it with the analytical form of the conditional distribution. In the image below, the blue histogram represents samples from the true distribution, while the blue line is the Stan posterior result. They both agree!

image

Code normal distribution
data {
    real s;
    real mu1;
    real mu2;
    real<lower=0> sigma1;
    real<lower=0> sigma2;
}
parameters {
    real x;
}
model {
    x ~ normal(mu1, sigma1);
    s ~ normal(mu2 + x, sigma2);
}

However, if we have another very simple case, problems start to appear. Suppose X and Y have Gamma distributions with a shared scale parameter. We can derive an analytical expression for the conditional distribution again for comparison. Unfortunately, Stan does not work when data comes from gamma distributions with small shape parameters! With \alpha_x = 0.2, \alpha_y = 0.2 and \beta = 0.2, the image below shows the result. Notice that the samples from the true conditional distribution (the histogram) are closer to the bounds! Divergences appear mainly next to the boundary, but not exclusively. This is not so surprising since the curvature changes too much and a large part of the mass is near the boundary. If we suppose that the random variables come from beta distributions, we do not have closed-form expressions to compare, but the divergences show that this problem persists and is even worst!

image

Below we see the divergences.

image

Code gamma distribution
data {
    real s;
    real<lower=0> alpha1;
    real<lower=0> alpha2;
    real<lower=0> beta1;
    real<lower=0> beta2;
}
parameters {
    real<lower=0, upper=s> x;
}
model {
    target += gamma_lpdf(x | alpha1, beta1);
    target += gamma_lpdf(s-x| alpha2, beta2);
}

I notice that the problem appears when the parameters are small. When they are not, all works fine. Moreover, I remember that sampling from gamma or beta distribution with small parameters is hard. Sometimes R gives samples 1 or 0, which should not happen.

I would be very thankful if someone could help me with this!

2 Likes

@Bob_Carpenter @betanalpha @nhuurre @jsocolar

The problem here is really nothing more than the fact that sometimes conditional distributions are more complicated than the joint distribution!

Firstly let me note that the Stan programs here are correct but the interpretation is a bit off. In particular conditioning on a fixed constraint is just a general probabilistic operation and is fundamentally different than Bayesian inference. Given a joint density function \pi(x_{1}, x_{2}) we condition on the constraint f(x_{1}, x_{2}) = s by restricting the ambient space to those values of x_{1} and x_{2} that satisfy the constraint. Geometrically this corresponding to “slicing” the joint density function along the set of points that satisfy the constraint; see for example Figure 17 in https://betanalpha.github.io/assets/case_studies/conditional_probability.pdf.

Mathematically the condition density function is given up to normalization by partially evaluating the joint density function on the constraint,

\pi(x_{1}, x_{2} \mid f = s) \propto \pi(x_{1}, \tilde{x}_{2})

where

f(x_{1}, \tilde{x}_{2}) = s.

For example if

f(x_{1}, x_{2}) = x_{2}

then we get

\pi(x_{1}, x_{2} \mid f = s) \propto \pi(x_{1}, s)

which is more commonly written as \pi(x_{1} \mid x_{2} = s). Or in the case of this thread we have

f(x_{1}, x_{2}) = x_{1} + x_{2}

which gives

\pi(x_{1}, x_{2} \mid f = s) \propto \pi(x_{1}, s - x_{1}).

This is confused by the fact that sometimes people suggest implementing a “soft” constraint with a pseudo-likelihood function of the form

\mathcal{L}(x_{1}, x_{2}) = \text{normal}( f(x_{1}, x_{2}) \mid s, \sigma)

and then taking \sigma to be small. The resulting posterior will concentrate around the set f(x_{1}, x_{2}) = s but will not satisfy it exactly, and the posterior behavior as \sigma \rightarrow 0 can be problematic. Moreover as this pseudo-likelihood doesn’t arise from a statistical model this process can’t really be interpreted as an implementation of Bayes’ Theorem or learning from data.

Okay, so with this in mind we can see that @lucasmoschen’s Stan programs correctly implement the probabilistic conditioning. What, however, is going on with the gamma version?

The problem is that when the parameters of the gamma family are small the density functions concentrate at zero. The product of two independent density functions then concentrate along their respective axes. Slicing along the line x_{2} = s - x_{1} for large enough s then results in a multimodal density function that concentrates at the lower and upper boundaries corresponding to each axis.

This explains the shape of the conditional distribution that you’re seeing, and also why Stan is having problems. Stan’s Hamiltonian Monte Carlo sampler needs very small step sizes to resolve the rapidly changing shapes near the boundaries but the transition between the boundaries requires much larger step sizes. Stan tries to compromise with an intermediate step size which isn’t good enough to resolve the boundaries, resulting in those divergences. There might also be numerical problems with the transformation that Stan uses to respect the interval constraint.

To summarize – the method proposed by @lucasmoschen is correct, albeit a general probabilistic operation and nothing to do with Bayesian inference. The problems with the gamma densities arises from the fact that the exact conditional density function is nasty. Stan’s sampler is doing its best to explore, and producing a decent approximation, but it’s not perfect.

2 Likes