Need help on sbc uniform proof

This is from appendix B of SBC paper.
I am a little confused about the following, could anyone please explain in more detail?

  1. variable changing part and it is p or P?
  2. could P(\tilde{f(p)}) = p be explained more mathematically or intuitively?

According to the paper, \tilde{f} is the evaluation of the random variable with respect to the prior sample with f_n = f(\theta_n ) the evaluation of the random variable with respect to the posterior sample.


2 Likes

If @betanalpha can’t explain this then nobody can :)

@hyunji.moon The result here is just the fact the quantile function is the inverse of a cumulative distribution function and vise versa. It’s also why the null model for ranks is also in general uniform. Really the only insight in SBC is that the average posterior is equal to the prior so that the quantile function of one is the inverse of the cumulative distribution function of the other.

Note also that technically the proof doesn’t actually cover SBC as we describe it in the paper. The proof is valid if the prior sample and posterior samples are generated independently, but the procedure we recommend draws the posterior samples conditioned on the prior sample. Empirically is still works but the proof is much more complicated and few have spent much time trying to expand the proof (which probably involves using the exchangeability of the posterior samples).

I had to stop and think about this for a minute because you don’t recommend using the prior sample \tilde{\theta} to inform the posterior samples \theta_{i}, right? The theoretical dependency is indirect, only through the posterior being conditional on the simulated data \tilde{y} which in turn is conditional on \tilde{\theta}?
I think such indirect dependency becomes irrelevant if you think in terms of the joint distribution \pi\left(\theta,y\right) without worrying about splitting it into the prior and likelihood. It doesn’t seem like the general proof should be complicated.
Unless I’m missing something you can just argue that because

\pi\left(\theta\right)\pi\left(y\middle|\theta\right)=\pi\left(y\right)\pi\left(\theta\middle|y\right)=\pi\left(\theta,y\right)

then the SBC algorithm

  1. draw prior sample \tilde{\theta}\sim\pi\left(\theta\right)
  2. draw simulated data set \tilde{y}\sim\pi\left(y\middle|\tilde{\theta}\right)
  3. draw posterior samples \theta_{1},\dots,\theta_{L}\sim\pi\left(\theta\middle|\tilde{y}\right)

samples from the same joint distribution \left(\tilde{\theta},\tilde{y},\theta_{1},\dots,\theta_{L}\right)\sim\pi\left(\theta,y\right)\pi\left(\theta\middle|\tilde{y}\right)^{L} as

  1. draw prior predictive sample \tilde{y}\sim\pi\left(y\right)
  2. draw posterior sample \tilde{\theta}\sim\pi\left(\theta\middle|\tilde{y}\right)
  3. draw posterior samples \theta_{1},\dots,\theta_{L}\sim\pi\left(\theta\middle|\tilde{y}\right)

where it’s now obvious that \tilde{\theta} is exhangable with each of \theta_{1},\dots,\theta_{L} and the rank statistic must be uniformly distributed.

3 Likes

Unfortunately that’s not sufficient. In SBC the expectations that we’re computing are actually with respect to the joint distribution

\pi(\theta', y, \theta) = \pi(\theta' \mid y) \pi(y \mid \theta) \pi(\theta).

As you note the influence of \theta on \theta' is only indirect through the intermediate data y, but that indirect influence is still enough to prevent the nested integrals from factoring in a clean way and allowing a straightforward calculation.

Another way of stating it is that the implicit function whose expectation we’re taking to get the ranks depends on both \theta and \theta' and hence the full conditional structure of \pi(\theta', y, \theta).

1 Like

Here’s more detail in case anyone has suggestions.sbc_calcs.pdf (99.0 KB)

3 Likes

Side note, I hate the conventional notation that overloads the symbols so that every distribution is called \pi\left(\cdot\right). Let’s go with probability density functions \pi_{\mathrm{prior}}\left(\theta\right), \pi_{\mathrm{sample}}\left(y|\theta\right) and \pi_{\mathrm{post}}\left(\theta|y\right).

The equality f\left(u\right)=u does not hold even in the most elementary example.
For suppose \theta\sim\mathit{Uniform}\left(0,1\right) and y\sim\mathit{Bernoulli}\left(\theta\right). The density functions are

\pi_{\mathrm{prior}}\left(\theta\right)=1 \\ \pi_{\mathrm{sample}}\left(y|\theta\right)=\begin{cases} 1-\theta & y=0\\ \theta & y=1 \end{cases} \\ \pi_{\mathrm{post}}\left(y|\theta\right)=\begin{cases} 2-2\theta & y=0\\ 2\theta & y=1 \end{cases}

The prior quantile function \tilde{\theta}_{0}\left(u\right)=u is simply the identity and the y integral becomes a sum so that

\begin{equation} \begin{split} f\left(u\right) & = \int_{-\infty}^{\tilde{\theta}_{0}\left(u\right)}\mathrm{d}\theta\,\sum_{y}\,\pi_{\mathrm{post}}\left(\theta|y\right)\pi_{\mathrm{sample}}\left(y|\tilde{\theta}_{0}\left(u\right)\right)\\ & = \int_{0}^{u}\mathrm{d}\theta\,\left(\pi_{\mathrm{post}}\left(\theta|0\right)\pi_{\mathrm{sample}}\left(0|u\right)+\pi_{\mathrm{post}}\left(\theta|1\right)\pi_{\mathrm{sample}}\left(1|u\right)\right)\\ & = \int_{0}^{u}\mathrm{d}\theta\,\left(\left(2-2\theta\right)\left(1-u\right)+2\theta u\right) \\ & = \left(2u-u^{2}\right)\left(1-u\right)+u^{2}u\\ & \neq u \end{split} \end{equation}

Regardless, I don’t see why the full distribution wouldn’t factor in a clean way. The obstruction arises when integrating out \tilde{y}; but what if you condition on \tilde{y} instead?

The full joint density is

\pi_{\mathrm{prior}}\left(\tilde{\theta}_{0}\right)\pi_{\mathrm{sample}}\left(\tilde{y}|\tilde{\theta}_{0}\right)\prod_{l}\pi_{\mathrm{post}}\left(\theta_{l}|\tilde{y}\right)

and the Bayes’ theorem tells us that

\pi_{\mathrm{prior}}\left(\tilde{\theta}_{0}\right)\pi_{\mathrm{sample}}\left(\tilde{y}|\tilde{\theta}_{0}\right)=\pi_{\mathrm{post}}\left(\tilde{\theta}_{0}|\tilde{y}\right)L\left(\tilde{y}\right)

where

L\left(y\right)=\int\mathrm{d}\theta\,\pi_{\mathrm{prior}}\left(\theta\right)\pi_{\mathrm{sample}}\left(y|\theta\right)

so that the joint density is in fact equal to

L\left(\tilde{y}\right)\pi_{\mathrm{post}}\left(\tilde{\theta}_{0}|\tilde{y}\right)\prod_{l}\pi_{\mathrm{post}}\left(\theta_{l}|\tilde{y}\right)

and conditional on \tilde{y} all thetas are IID.

1 Like

Again, manipulating the joint density isn’t sufficient without considering the appropriate expectation. In particular because the order statistic conditions on a prior sample one has to be careful to respect that conditioning otherwise the integrals will remain intractable.

In looking back on this again I realized that I was trying to prove the wrong thing. The marginal conditional

\pi(\theta' \mid \theta) = \int \mathrm{d} y \, \pi(\theta', y \mid \theta)

corresponds to simulating a single model configuration from the prior but then multiple observations and then multiple posterior samples before computing the rank, which is not how we construct the SBC algorithm in the paper. What we actually do is sample a single observation for each model configuration and then a single observation before generating multiple posteriors samples and then computing the rank. This subtle difference implies that the probabilities in the order statistic are actually conditional on the data as well,

\pi(\theta' \mid \theta, y)

with an extra integral over y on the outside.

This conditional structure of the joint then allows us to simplify \pi(\theta' \mid \theta, y) = \pi(\theta' | y) from which the proof proceeds basically as intended.

I’m attaching what I believe is a proper proof for anyone interested.sbc_calcs_proper.pdf (96.8 KB)

6 Likes

If anyone has the time I’m super curious to see the result of an empirical study where the data are simulated multiple times (and only one posterior sample drawn from each posterior)
i.e.

\begin{align*} \tilde{\theta} &\sim \pi(\theta) \\ \tilde{y}_{n} &\sim \pi(y \mid \tilde{\theta}) \\ \tilde{\theta}'_{n} &\sim \pi(\theta \mid \tilde{y}_{n}) \\ r &= \sharp [ \theta < \tilde{\theta}'_{n} ] \end{align*}

Are the ranks uniform, implying that there should be some way to get the old proof to work, or are they in fact non-uniform?

1 Like
from numpy import zeros, arange
from numpy.random import beta, binomial
from matplotlib.pyplot import hist, show
r = []
for _ in range(10_000):
  t = beta(1,1)
  y = binomial(1,t,size=10)
  tt = beta(1+y,2-y)
  r.append(sum(t < tt))

hist(r, bins=arange(12)-0.5)
show()

sbc

4 Likes

Perfect. I spent some time after dinner to put together a R demonstration for anyone interested.

c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

m <- 0
s <- 0.5
s2 <- s * s

sigma <- 1
sigma2 <- sigma * sigma

N <- 100000 # Number of replications
L <- 25     # Number of posterior samples
B <- 25     # Number of rank bins

# Correct SBC
ranks <- rep(0, 100)
for (n in 1:N) {
  theta <- rnorm(1, m, s)
  y <- rnorm(1, theta, sigma)
  post_m <- (y * s2 + m * sigma2) / (s2 + sigma2)
  post_s <- sqrt( s2 * sigma2 / (s2 + sigma2) )
  post_theta <- rnorm(L, post_m, post_s)
  ranks[n] <- sum(post_theta < theta)
}

sbc_hist <- hist(ranks, breaks=seq(0, L + 1, 1) - 0.5, plot=FALSE)
plot(sbc_hist, main="SBC (Correct)", xlab="Prior Rank", yaxt='n', ylab="")

low <- qbinom(0.005, N, 1 / (L + 1))
mid <- qbinom(0.5, N, 1 / (L + 1))
high <- qbinom(0.995, N, 1 / (L + 1))
bar_x <- c(-1, L + 1, L + 0.5, L + 1, -1, -0.5, -1)
bar_y <- c(high, high, mid, low, low, mid, high)

polygon(bar_x, bar_y, col=c("#DDDDDD"), border=NA)
segments(x0=-0.5, x1=(L + 0.5), y0=mid, y1=mid, col=c("#999999"), lwd=2)

plot(sbc_hist, col=c_dark, border=c_dark_highlight, add=T)

# Incorrect SBC
ranks <- rep(0, 100)
for (n in 1:N) {
  theta <- rnorm(1, m, s)
  y <- rnorm(L, theta, sigma)
  post_m <- (y * s2 + m * sigma2) / (s2 + sigma2)
  post_s <- sqrt( s2 * sigma2 / (s2 + sigma2) )
  post_theta <- rnorm(L, post_m, post_s)
  ranks[n] <- sum(post_theta < theta)
}

sbc_hist <- hist(ranks, breaks=seq(0, L + 1, 1) - 0.5, plot=FALSE)
plot(sbc_hist, main="SBC (Incorrect)", xlab="Prior Rank", yaxt='n', ylab="")

low <- qbinom(0.005, N, 1 / (L + 1))
mid <- qbinom(0.5, N, 1 / (L + 1))
high <- qbinom(0.995, N, 1 / (L + 1))
bar_x <- c(-1, L + 1, L + 0.5, L + 1, -1, -0.5, -1)
bar_y <- c(high, high, mid, low, low, mid, high)

polygon(bar_x, bar_y, col=c("#DDDDDD"), border=NA)
segments(x0=-0.5, x1=(L + 0.5), y0=mid, y1=mid, col=c("#999999"), lwd=2)

plot(sbc_hist, col=c_dark, border=c_dark_highlight, add=T)
3 Likes

I had some questions on symmetry and integrating out y.

  1. Would there be any intuitive explanation on how integrating over y happens during SBC? From the proof above, I understand the role of y integration mathematically as below. However, as y is fixed during the ranking, it is hard to imagine the variability of y cooked into the concept of integration. I thought coming up with a verbal explanation of u(y) or extension of @nhuurre 's L(y) here could help as it is the result of \theta_0 marginalized out.

probability of rank, \pi(r)

= \frac{L !}{r !(L-r) !} \int \mathrm{d} y \pi(y) \int \mathrm{d} \theta_{0} \pi\left(\theta_{0} \mid y\right)\left[\int_{-\infty}^{\theta_{0}} \mathrm{~d} \theta_{l} \pi\left(\theta_{l} \mid y\right)\right]^{r}\left[1-\int_{0}^{\theta_{0}} \mathrm{~d} \theta_{l} \pi\left(\theta_{l} \mid y\right)\right]^{L-r}
=\frac{L !}{r !(L-r) !} \int \mathrm{d} y \pi(y) \int \mathrm{d} u(y)[u(y)]^{r}[1-u(y)]^{L-r}
=\frac{1}{L+1} \int \mathrm{d} y \pi(y) = \frac{1}{L+1}

when u(y)=\int_{-\infty}^{\theta_{0}} \mathrm{~d} \theta \pi(\theta \mid y)

  1. If we marginalize y out from the joint \pi(\theta',y, \theta), is \pi(\theta',\theta) is symmetric on \theta', \theta? Without the connecting mechanism y, this might not hold; if so how can I recover the symmetry in mathematical form? This is also alluded in section 3. of the paper and background section of @andrewgelman’s blog.

My careful assumption is that if (C) holds for every distribution of \theta (ie.g(\theta)), we could reach the conclusion of (A) through (B). From empricial experiments, we know that (C) holds for reasonable distributions (eg. informative priors).

\pi(\theta',\theta) is symmetric -(A)
\pi(\theta'|\theta) = \int \mathrm{d} y \; \pi(\theta' \mid y) \pi(y \mid \theta) is centered on \theta - (B)
g(\theta') = \int \mathrm{~d} \theta \pi(\theta|'\theta) g(\theta) - (C)

I especially wish to double-check the equation (B), as mentioned by @betanalpha below, that is not what is happening in the empirical process of SBC. However, what is actually happening might not be important as long as the recovery of the original distribution is vouched based on the symmetry concept.

2 Likes

Without commenting in detail on the questions below, let me just say that I think that some of the most interesting research questions in SBC have to do with the practical version where we don’t actually sample many draws of theta from the prior and y from the data model (i.e., sample (theta,y) from the joint distribution) but rather when we pick one or two reasonable values of theta and then sample one dataset y for each of these sampled thetas. Or, in a hierarchical model in which theta=(phi,alpha), where phi are hyperparameters and alpha are intermediate parameters,maybe we pick one or two values of phi and then, for each, take one draw of alpha | phi and then one draw of y | alpha, phi. Here we’re not averaging over the joint distribution; we’re relying on some sort of internal replication in the model. The point is that we catch problems using this strategy all the time, so it would be good to better understand the conditions under which it is useful.

3 Likes

Recall that sampling is just a way to approximate expectation/integration. Anytime you have a method that samples from a distribution you are approximating an expectation with respect to that distribution. In SBC we utilize a sample from the full Bayesian model

\tilde{\theta}, \tilde{y} \sim \pi(\theta, y)

which is typically generated through ancestral sampling of the conditional decomposition \pi(\theta, y) = \pi(y \mid \theta) \, \pi(\theta),

\tilde{\theta} \sim \pi(\theta) \\ \tilde{y} \sim \pi(y \mid \tilde{\theta}).

That latter step is what “implements” the integration over y.

It’s all a consequence of the particular conditional dependence structure of the joint distribution. In particular we can use Bayes’ Theorem to write the joint density \pi(\theta', \theta) as

\begin{align*} \pi(\theta', \theta) &= \int \mathrm{d} y \, \pi(\theta' \mid y) \, \pi(y, \theta) \\ &= \int \mathrm{d} y \, \pi(\theta' \mid y) \, \pi(y \mid \theta) \, \pi(\theta) \\ &= \pi(\theta) \int \mathrm{d} y \, \pi(\theta' \mid y) \, \pi(y \mid \theta) \end{align*}

or

\begin{align*} \pi(\theta', \theta) &= \int \mathrm{d} y \, \pi(\theta' \mid y) \, \pi(y, \theta) \\ &= \int \mathrm{d} y \, \frac{ \pi(y \mid \theta') \, \pi(\theta') }{ \pi(y) } \, \pi(\theta \mid y) \, \pi(y) \\ &= \int \mathrm{d} y \, \pi(y \mid \theta') \, \pi(\theta') \, \pi(\theta \mid y) \\ &= \pi(\theta') \int \mathrm{d} y \, \pi(\theta \mid y) \, \pi(y \mid \theta'). \end{align*}

Because \pi(\theta') = \pi(\theta) – that’s the self-consistency condition on which SBC is based – we have

\begin{align*} \pi(\theta', \theta) &= \pi(\theta) \int \mathrm{d} y \, \pi(\theta' \mid y) \, \pi(y \mid \theta) \\ &= \pi(\theta') \int \mathrm{d} y \, \pi(\theta' \mid y) \, \pi(y \mid \theta) \\ &= \pi(\theta, \theta'). \end{align*}

I don’t understand what you’re trying to get at here. (C) can’t be true because it’s not even a well-defined expectation value.

1 Like

Would there be any way to differentiate this flag being real or a false positive? From the following quote from Bayesian workflow paper, I judged few instances of truth-point benchmark flagging a problem that is not actually there might be improbable but possible.

“simulation-based calibration and truth-point benchmarking are two ends of a spectrum. Roughly, a single truth-point benchmark will possibly flag gross problems, but it does not guarantee anything.”,