Estimating compound distributions without numerical integration

This post is a follow-up on the case study by Michael Betancourt on fitting Cauchy distributions. In it he present several alternative parameterizations of the Cauchy, the first of which I want to follow up on here. Specifically, he states that it can be reparameterized as a continuous mixture (or compound) distribution:

I would really like to understand this step more. Specifically, how it is possible to sample from this integral over tau without actually evaluating the integral? Is this a general way to evaluate compound distributions in Stan?

The reason I am asking is because I want to estimate a compound distribution of a von Mises with a Gamma (i.e., simply replace the Normal with a von Mises). This is the model presented in van den Berg et al. (2012)
(equation from the supplemental materials).


I guess this does not work for the von Mises, because the von Mises is not a member of the location-scale family, but maybe there is a way.

Marginalization is an implicit consequence of Monte Carlo integration.

We use (Markov chain) Monte Carlo to approximation posterior expectations through samples,

\hat{f}{N} = \frac{1}{N} \sum{n = 1}^{N} f(x)
E_pi [ f ] = \int dx pi(x) f(x).

Now let x be two dimensional, x = (x1, x2), but the function whose expectation we’re taking depend only on x1. Then

\hat{f}{N} = \frac{1}{N} \sum{n = 1}^{N} f(x_1)
E_pi [ f ] = \int dx_1 dx_2 pi(x_1, x_2) f(x_1).

But now we can marginalize x_2 out of the exact integral to give

\hat{f}{N} = \frac{1}{N} \sum{n = 1}^{N} f(x_1)
E_pi [ f ] = \int dx_1 pi(x_1) f(x_1).

In other words, if you run a Markov chain over the joint space (x_1, x_2) and then use samples from only one component to estimate expectations then we are implicitly marginalizing over the other component! This means that you can compute convolutions like the scale mixture in the case study by building a Stan program for the joint model and let the integration happen implicitly.

1 Like

Thanks for your response. I tried to format it more nicely to understand it easier. I hope this is what you intended to say. I will have to think a little bit about what this means for my case (if I have more questions, I will formulate them more clearly).

The β€œn = 1” and β€œN” are subscripts and superscripts for the summation, respectively. It may help to write out the marginalization explicitly like

= \int dx1 dx2 pi(x1, x2) f(x1)
= \int dx1 [ \int dx2 pi(x1, x2) ] f(x1)
= \int dx1 [ pi(x1) ] f(x1)
= \int dx1 pi(x1) f(x1)

Thanks, I think I got it now. The thing that was difficult to understand for me, was the idea of approximating the integral you are interested in when simultaneously sampling from the posterior. In other words, the step of also only approximating the integral one is interested in in the same manner one approximates the posterior.

One more thing. I also discussed this issue with Quentin Gronau (@Quentin) and he send me the following code suggesting that maybe the alternative implementation to the Cauchy requires s^2 and instead of s. This becomes apparent when changing s from 1 to e.g., .1

n <- 1e5
m <- 0.3
s <- 1

xa <- rnorm(n)

# 1
xb <- rgamma(n, 1/2, s/2)
x <- m + xa / sqrt(xb)
hist(x[abs(x) < m + 15*s], probability = TRUE, breaks = 50)
plot(function(x) dcauchy(x, m, s), add = TRUE, xlim = c(m - 15*s, m + 15*s))

# 2
xb2 <- rgamma(n, 1/2, (s^2)/2)
x2 <- m + xa / sqrt(xb2)
hist(x2[abs(x2) < m + 15*s], probability = TRUE, breaks = 50)
plot(function(x) dcauchy(x, m, s), add = TRUE, xlim = c(m - 15*s, m + 15*s))

Yup, that was a typo on my end. Thanks for pointing it out.

I’m updating the case study now.