How can we explain why SBC works?

We (Daniel Schad, Bruno Nicenboim & Shravan Vasishth) have a manuscript on arXiv ([2203.02361] Data aggregation can lead to biased inferences in Bayesian linear mixed models and Bayesian ANOVA: A simulation study) using SBC for Bayes factors (Schad et al., 2022, PsychMethods) to show that data aggregation can lead to biased inferences in Bayesian LMMs/ANOVA. In the manuscript, we provide a brief intro to SBC (see below).

A reviewer commented on this: "I think that the statement on L248 is unclear (on why Equation 6 shows that the distributions of M and M’ are identical). In the first paper version, the explanation was similar to that in Talts, Betancourt, Simpson, Vehtari, and Gelman (2018). But also there it is not said why this works, only that it does. Actually, the explanation provided in Cook, Gelman, and Rubin (2006) is the clearest I could find. Could you just add a few more words that completely explain this (to people for whom the SBC is new, like me)?“

One issue here is that a recent manuscript (Modrak et al., 2022) suggested that the “justification of SBC by Talts et al. (2018)” was not correct, and a different basis for SBC was suggested (Modrak et al., 2022; our adaptation of this to SBC for BF see below). However, on the other hand, we understood Michael Betancourt (personal communication) as suggesting that the original formulation is not incorrect.

Moreover, we don’t fully understand why the math provides a foundation for SBC (i.e., why equation 6 below shows that the distributions of M and M’ are identical).

Thus, overall, we are unsure about how to respond to this issue; i.e., how to deal with the different suggestions for foundations of SBC (Talts et al., 2018 vs. Modrak et al., 2022), and, importantly, how to explain why this works?

Any help with this would be highly welcome!

Thanks much

First, we write down the joint distribution of prior and posterior model probabilities as well as the data [see @modrak2022simulation for posterior parameter inference]:

p(y, \mathcal{M}', \mathcal{M}) = p(\mathcal{M}' \mid y) p(y \mid \mathcal{M}) p(\mathcal{M}) Equation (5)

Moreover, we have stated above that p(\mathcal{M} \mid y) = \frac{p(y \mid \mathcal{M}) \times p(\mathcal{M})}{p(y)}, which can be reformulated as p(\mathcal{M} \mid y) p(y) = p(y \mid \mathcal{M}) \times p(\mathcal{M}). This implies that

p(y, \mathcal{M}', \mathcal{M}) = p(\mathcal{M}' \mid y) p(\mathcal{M} \mid y) p(y) Equation (6)

Equation 6 shows that given a specific data set y, the distributions of \mathcal{M} and \mathcal{M}' are identical. That is, in SBC, if the simulation of the data and the estimation of posterior model probabilites (i.e., of Bayes factors) is accurate, i.e., without bias, then the prior probabilities of the models should be the same as the posterior probabilities of the models. If the average posterior model probabilities are too large/small, then this indicates that Bayes factors are estimated as being too large/small, exhibiting liberal/conservative bias.

@bnicenboim, @vasishth , @martinmodrak, @paul.buerkner, @betanalpha, @seantalts

9 Likes

I agree with the reviewer that the Cook et al. paper is the clearest for this.

I’m going to put this back in BDA notation so it’s easier for me.

Suppose we have a joint model p(\theta, y) = p(\theta) \cdot p(y \mid \theta) with parameters \theta \in \mathbb{R}^D and data y (of any shape). The chain rule gives us the following two results.

  1. If \theta^{\textrm{sim}} \sim p(\theta) and y^{\textrm{sim}} \sim p(y \mid \theta^{\textrm{sim}}), then (\theta^{\textrm{sim}}, y^{\textrm{sim}}) \sim p(\theta, y).

  2. If \theta^{(m)} \sim p(\theta \mid y^{\textrm{sim}}) for m \in 1:M, then (\theta^{(m)}, y^{\textrm{sim}}) \sim p(\theta, y) for m \in 1:M.

  3. From (1) and (2), it follows that the \theta^{(m)} and \theta^{\textrm{sim}} are identically distributed accoring to p(\theta \mid y^{\textrm{sim}}).

  4. From (3), it follows that \theta^{(m)}_n and \theta^{\textrm{sim}}_n are identically distributed for all dimensions n \in 1:N.

  5. From (4), it follows that the distribution of the rank of \theta^{\textrm{sim}} should be uniform among the \theta^{(m)}.

This is assuming that the draws are all i.i.d. What Talts et al. found was that autocorrelation among the \theta^{(m)} will disrupt (5) and that’s why they recommend thinning to roughly independent draws. If the sampler is broken that samples \theta^{(m)} \sim p(\theta \mid y^{\textrm{sim}}), in the sense that it’s not taking draws from this distribution, then the tests will fail.

8 Likes

Talts et al. is not “incorrect” in any strict sense - the preprint contains true statements about Bayesian models and their validation. What is potentially misleading is that the statements appear to form a coherent reasoning line from first principles to SBC, which they do not. Specifically, Talts et al. first invoke data-averaged posterior (their Eq. 1), then in Section 3 correctly claim that the method of Geweke (2004) uses the data-averaged posterior, then they correctly claim that Cook, Gelman and Rubin (2006, CGR) differs from Geweke’s method and that SBC remedies some problems with CGR method and then prove that SBC works. A reader may thus be led to believe that CGR (and by extension SBC) also relies on the data-averaged posterior, despite using a different check to validate the criterion. However, the criterion is in fact different as we show in our SBC preprint with specific examples (this is mildly alluded to by CGR as well, but I don’t think they are very explicit about the differences).

I remember being somewhat puzzled about that myself, but I think this is because the reasoning is just so straightforward, that I’ve been trying to see something deeper behind it. I think the confusion disappears if (analogously to the notation in our SBC preprint) we do not overload the same symbol for prior, posterior and marginal densities. If we do that, we get:

p_{\text{SBC}}(y, \mathcal{M}', \mathcal{M}) = p_{\text{post}}(\mathcal{M}' \mid y) p_{\text{post}}(\mathcal{M} \mid y) p_\text{marg}(y)

Now it is clear, that to calculate the density of both \mathcal{M} and \mathcal{M}', conditional on y we use the exact same computation - and two variables that have the same density at all points are identical (since the density is w.r.t. the same measure). So what I think needs emphasizing in your description is that the identity is conditional on y. SBC however cannot check this identity directly, it only checks that on average (over y) the conditional identity holds.

It is also true, that the distributions are identical marginally (i.e. after integrating over y), but that leads to the data-averaged posterior and the checks by Geweke (2004) and others. Marginal identity is also implied by conditional identity for all y, but the reverse implication does not hold. However, since no practical method can directly check the equality of distributions, the relationship between SBC and checks based on data-averaged posterior is a bit more complex and for a fixed, finite set of test quantities, passing SBC does not imply that data-averaged posterior is also correct.

So one way to see the difference is that the marginal identity (“data-averaged posterior equals prior”) is a weak property, which can be however checked quite directly. The conditional identity is a strong property, but it cannot be checked directly - SBC is only an approximate/weak check of this stronger property.

Does that resolve the confusion?

I’ll also note that we recently updated our preprint, although the changes don’t really matter for the discussion.

4 Likes

I think the easiest explanation of sbc is that, given a simulated y, the simulated theta and your original theta are two IID copies from p(theta | y), which enables any hypothesis testing strategy you would like, such as the histogram-based-comparison or classification.

3 Likes

Another way to look at it: the equation shows two distributions that were assumed identical.

We start with a joint distribution p(y,\theta). For any arbitrary random variable X we can write a joint distribution p(y,\theta,x).

If the distribution of X is independent of Y and \Theta that will be equal to p(y,\theta,x)=p(y,\theta)p(x). In the more general case it will be p(y,\theta,x)=p(y,\theta)p(x \mid y,\theta). Let’s say that the distribution of X depends (only) on Y. Then p(y,\theta,x)=p(y,\theta)p(x \mid y).

We can write p(y,\theta) as p(y)p(\theta \mid y). We end up with p(y,\theta,x)=p(y)p(\theta \mid y)p(x \mid y).

If the distribution p(x \mid y) is not arbitrary because it was chosen to have the same form as p(\theta \mid y) then… it has the same form as p(\theta \mid y).