I am confused about 2 and 3 but as this post points out, for fixed FP, underpowered test decreases TP and therefore FDR increase. So 2 is false while 3 is true was my logic. However, I have read that SBC has large false positives as well…
Also I am curious about what other cases SBC can be underpowered. Thanks!
One case where SBC could give misleading result is with a small number of prior samples. We cannot reject the null hypothesis (= uniform rank) as one cannot tell rank 1,5,10 out of 10 is uniform or not. So an example with a wrong model and small sample size would contribute the false positiveness of SBC test.
Without an explicit test for uniformity concepts like false and true discovery rates, significance and power, aren’t immediately relevant to Simulation-Based Calibration. In particular the Simulation-Based Calibration™ method as defined in the original paper explicitly avoided any particular test because they all introduced their own limitations that the visualizations were largely able to avoid.
Null-hypothesis significance testing is best interpreted as rejecting the null hypothesis in favor of a particular alternative hypothesis. In the Simulation-Based Calibration setting that means tests that reject a uniform rank hypothesis for an explict deviation from uniformity. Many generic uniformity tests are underpowered because the alternative hypothesis is too generic and hence not all that sensitive. The only way to increase power is to design a test that considers more narrow alternative hypothesis, for example just “smiles” or “frowns” or concentration towards large ranks or small ranks. While this might increase the power it also limits the particular non-uniformities to which a test is sensitivity.
If one does want to confront the difficulties of trying to systematically test for uniformity then the behavior of that test, in particular false and true positive rates and the like, will depend on the particular details of the chosen test. It’s hard, if not impossible, to say much of anything in general.
Is this a big problem? I mean, if we have set of behaviours we want to test against and design the tests to have good power, is the lack of flexibility really a hindrance? After all, these are ranks we are talking about; a low dimensional projection of the posterior draws. I would hazard you can’t have a rich set of behaviours and thus it’s best to concentrate on testing against a select few.
The behaviors that manifest in the rank plots aren’t all that limited by the restriction to a one-dimensional pushforward, and the histogram visualization is able to exhibit any systematic deviant behavior for the analyst, not just those that one might expect.
Relying on a suite of tests for predefined deviations not only ignores any other deviation behaviors but also introduces a very subtle multiple comparisons problem. In other words even if the tests are high powered individually they may not be as well-behaved when all run at the same time, and because the systematic behaviors are nontrivially correlated working out the multiple comparisons behavior is nigh impossible.
I know you and your coauthors left quantitative testing as an area for future work in the SBC paper, and recommended just visualizing the histogram. For those of us foolish enough to attempt a quantitative measure, are there any cutting-room-floor suggestions?
I’d been playing with Dirichlet over histogram bin proportions; set prior alpha = 1, increment each bin’s count by the observed rank statistics, which gives you a posterior over histogram bins, and then you can calculate the probability of various pathologies (smiles, frowns, etc.)
I take your point about the limitations of quantitative tests and the advantages of visual inspection, but for some of us the choice is quantitative test or no test, and I’d rather have an automated test for several known, common pathologies than no tests at all.
Keep in mind that all this analysis can do is convert a frequentist testing paradigm into a Bayesian testing paradigm, but the fundamental limitations of testing (i.e. reducing inferences to explicit decisions) still arise. In particular one has to consider how to turn the Dirichlet posterior into probabilities for the various pathologies (which can be computationally challenging and very sensitive to seemingly irrelevant prior assumptions) and then calibrate those probabilities into actual false positive and true positive rates (which can be very different from the posterior probabilities and requires some assumption of an underlying model).
I’m not saying that testing is impossible, just that it’s difficult and very time-consuming to evaluate the practical performance of a testing procedure in circumstances like these (in many cases much more time-consuming than just identifying a reasonably small subset of meaningful variables and looking at the corresponding histograms). Putting that work in and deriving an evaluation based on appropriate assumptions is awesome when feasible! Evaluations based on default assumptions, however, typically lead to disappointing performance, with each analysis disappointing in some new way.