SBC with histogram/ecdf vs treating credible intervals as confidence intervals

Ok, this is not really about Stan, but I guess that there are enough SBC experts around to answer me :)

(Take into account that maybe my question is super naive.)

Cook, Gelman, and Rubin (2006) write the following

The basic idea is that if we draw parameters from their prior distribution and draw data from their sampling distribution given the drawn parameters, and then perform Bayesian inference correctly,
the resulting posterior inferences will be, on average, correct. For example, 50% and 95% posterior intervals will contain the true parameter values with probability 0.5 and 0.95,
respectively.

This is to verify that the credible intervals are actually confidence intervals as well, isn’t it?
Regardless of that, I was wondering the following.

Say that I run 100 simulations a la SBC. I sample 100 times:
\tilde{\mu} \sim p(\mu)
\tilde{\sigma} \sim p(\sigma)
\tilde{D} \sim p(D| \tilde{\mu},\tilde{\sigma})

I calculate the rank statistic for each simulation, and also the 50% credible interval as below, where value stores the ground truth of mu and sigma for each simulation.

> df_sbc %>% print(n=10)
# A tibble: 200 Ă— 7
     sim var    rank   X25.   X50.  X75.  value
   <dbl> <chr> <dbl>  <dbl>  <dbl> <dbl>  <dbl>
 1     1 mu      435 -0.167 0.520  1.26   0.975
 2     1 sigma   403  7.55  8.02   8.56   8.20 
 3     2 mu      156 -0.309 0.0260 0.368 -0.322
 4     2 sigma   215  3.10  3.34   3.61   3.17 
 5     3 mu      652  4.11  4.59   5.03   5.90 
 6     3 sigma   429  4.12  4.44   4.79   4.62 
 7     4 mu       36  2.42  2.61   2.81   2.15 
 8     4 sigma   550  1.66  1.80   1.96   2.04 
 9     5 mu       80 -0.272 0.0115 0.293 -0.456
10     5 sigma   281  2.39  2.59   2.80   2.53 
# … with 190 more rows
> 

Why is it an improvement to look at the histogram and ecdf based on the rank, rather than to use the 50% intervals and just check whether value is 50% of time inside the 50 interval, 25% time below it and 25% above it?

I can even use to get a 95% interval for each number:

> qbinom(c(0.025,0.975), size = 100, prob = .5)
[1] 40 60
> qbinom(c(0.025,0.975), size = 100 , prob = .25)
[1] 17 34

Is it because this less sensitive to bad calibration than the histogram/ecdf plot? Do I need more simulations here to conclude something? Or am I missing something?

1 Like

I think so (hopefully SBC experts will chime in to verify), but for me the ecdf diagnostic visual is more elegant in the sense that it eliminates the decision of what intervals to check. Why would we think the inside/outside ratio of the central 50% interval is particularly diagnostic? It could be that that ratio comes out as expected, but some other interval that we haven’t chosen shows pathology (ex. maybe the 12.5% that fall below the interval all fall right up near that boundary, leaving the full tail unexplored properly).

2 Likes

A couple reasons:

  1. The statement “posterior ranks are uniformly distributed” is (almost) equivalent to “all X% credible intervals contain true value X% of the time”, I.e you test all the intervals at once
  2. With the ECDF plots, you not only test all intervals at once, but you get a test that accounts for the non-independence in coverage of individual intervals and allows for small deviations to “add up” and make you notice a failure even if the individual intervals would look good
  3. central posterior intervals are particularly poorly suited for diagnostics as "all central intervals are calibrated " does not imply “all intervals are calibrated”. A simple counterexample: imagine ranks are uniformly distributed between 0 and 50% . This will result in all central intervals having expected coverage, despite true values always being smaller than posterior median. In contrast if all “leftmost” intervals are well calibrated, then all intervals have to be calibrated.

Still looking at coverage is IMHO useful and is implemented in the SBC package via empirical_coverage and plot_coverage. I think some of the advantages/limitations of coverage are shown at SBC rank visualizations • SBC

Does that explain the issue?

5 Likes

Mostly yes, but to be honest, the following thing is not clear at all to me.

“A simple counterexample: imagine ranks are uniformly distributed between 0 and 50% . This will result in all central intervals having expected coverage, despite true values always being smaller than posterior median”

One limitation of the plots in SBC rank visualizations • SBC is that they are based on the 95% interval, and they compare the proportion inside and outside. At least for me, looking at what happens to the left and right intervals outside a central interval seems very intuitive. (And 50% intervals are nice because there is enough probability mass outside of them as well).

1 Like

To elaborate a bit: Suppose we do a bunch of simulations and fit those with perfectly OK models that produce the correct posterior distributions. Now we take one of the simulations and flip the posterior samples around the true value of one (or multiple) parameters, i.e. new_samples = true_value - (old_samples - true_value). Some things to note:

  1. Except for some very special cases, new_samples now represent an incorrect posterior distribution
  2. Rank of true_value in new_samples is max_rank - rank_in_old_samples
  3. true_value lies within X% central interval in new_samples if and only if it lies within X% central interval in old_samples

So we can do this “flip” transform to any subset of the posterior distributions and it will preserve the coverage of all central intervals, despite the posteriors being wildly incorrect. The situation I discussed previously is a special case of this: if you “flip” only the posteriors that have true_value > median(old_samples), your ranks will be uniformly distributed between 0 and max_rank / 2.

Note however, that this “flip” doesn’t generally preserve the coverage of “leftmost” intervals and thus ECDF (and rank plot) will typically let you notice this. To fool ECDF/rank histogram with flipping, you need to flip the posteriors of all simulations at once - in such a case, the coverage of all intervals will be correct, despite the posterior being incorrect.

I am not sure I understand the issue here. Just to be clear, there are two types of “intervals” in the rank plots: on one hand, there is a summary of the coverage of posterior intervals from the fits performed (in ECDF plot, the ECDF can be thought of as “leftmost” intervals of various widths, in rank histogram, each bin represents a narrow interval of fixed width in sequence, coverage plot by defaukt uses central intervals of various widths), on the other hand there is an uncertainty interval (by default 95%) for the coverage of the posterior intervals (in ECDF plots and rank histogram this interval corresponds to expected variability under uniform distribution, in coverage plot it corresponds to remaining uncertainty about the true coverage). So all the plots are based on many posterior intervals, while showing a single uncertainty interval.

If I understand your initial suggestion, you want to look at several central posterior intervals and then compute a single uncertainty interval as well (note that your suggest way to compute uncertainty is the same as used for the rank histogram plot at https://github.com/hyunjimoon/SBC/blob/master/R/plot.R#L83). If that’s correct then I don’t think it could work better than looking at all intervals at once as the plots do.

Hope that clarifies more than confuses.

3 Likes

Very interesting example, thanks!

Sorry, what I said was super confusing. I wanted to say that plot_coverage cannot distinguish when the model underestimates or when it overestimates, right? But that information is available, one can also check the number of times the ground truth is left from the central interval or right from the interval. But no idea how to do nice plot with that. In any case, thanks for you explanation, it’s clear to me now :)

2 Likes

That’s kind of the point of ecdf. Look everywhere how much is on left (and the rest is on right) and compare that to how much we expect there to be on left.

2 Likes