SBC doesn't guard against ignoring (some) data/parameters

So I’ve been recommending simulation-based calibration (SBC) as the go-to method to validate your models and try to use it as much in my practice as I can. In the past few days I noticed there is a class of bugs that don’t get caught by SBC: the model ignoring data. Also ignoring a parameter in the likelihood part may lead to only very small deviances in SBC plots. Maybe that is obvious to some, but it wasn’t obvious to me, so I’m sharing the experience.

Remember that the main idea behind SBC is that when you repeatedly simulate data from a model’s prior and fit the model to simulated data, you end up recovering the prior. If your model accidentally ignores all the data, it will recover the prior and pass SBC checks very easily. The same holds if your model ignores only part of the data but is otherwise correct - for example, I had an indexing error that made my model ignore the last datapoint for each group of observations and it passed SBC with flying colors.

To a lesser extent, SBC plots can be mostly fine when you ignore a parameter (in my case one term in the linear predictor part of a model). This manifested as a very slight skew in the standard deviation term, but other than that, the SBC plots looked nice.

I was able to notice the “ignore all data case” and “ignore one parameter” easily because in addition to the SBC histograms I always plot a scatter of true value vs. posterior mean/median, which gives a sense of the precision with which the model estimates the parameter so when the model ignores data/parameters it manifests as a lack of correlation in this plot.

Here is an SBC plot + scatter from an OK model (sorry for just showing 50 SBC steps, but it takes time to run and I have work to do :-) ):

And here are the same plots when beta[1] parameter has been left out of the model likelihood:

You’ll notice that the SBC plots look roughly the same but the scatter shows that the posterior median is not influenced by the true value in the second case…

Also the data ignoring problems become apparent when doing posterior predictive checks, so yay for PP checks!

Hope that helps somebody :-)

5 Likes

It’s “obvious” in the sense that SBC can also correctly calibrate models with missing data. There’s no practical difference between a model that mistakenly ignores a datapoint and a model fit to a dataset that doesn’t contain the datapoint in the first place. I don’t know if it’s “obvious” in the sense that someone has though of it before.

The theoretical justification for SBC is that if you draw \theta from the prior p(\theta) and then x from the conditional p(x|\theta) then you get exactly the same joint distribution (\theta,x) as you would get if you first draw x from the marginal distribution p(x)=\int p(x|\theta)d\theta and then \theta from the posterior p(\theta|x). That is the definition of posterior.
So in SBC you make one draw from the marginal data distribution, get one true posterior draw for free and then use your favorite algorithm to draw hundred more approximate posterior samples. If the algorithm works then the true posterior draw is IID (independent, identically distributed) with the approximate draws.
Note that–although the original SBC paper didn’t emphasize this–you can use arbitrary ranking function that depends on both parameters and data and the rank distribution is still expected to be uniform.
Data-dependent test function is more difficult to “fake”.

4 Likes

It’s always hard to tell what’s obvious in retrospect, but in general, test harnesses only test what they’re given. So if you test the wrong function and get an all-clear, there’s not much the test can do about not getting the right function to test.

To riff off of this comment, in software engineering we’d typically produce tests as an effort separate from the code being tested and often under a different paradigm to that in which the code was written; ofttimes tests are produced before the code is written, ofttimes they are produced by a separate person. All of it serves to decorrelate human error between spec/test and code.

Perhaps something similar may port well to a model testing workflow too?

It’s not so simple because there are two components that are tested against each other. You have the (assumed to be correct) RNG that draws prior predictions and the algorithm under test that tries to recover the parameters of the simulation. @martinmodrak reports a situation where the prior predictive part is completely correct and the attempted posterior inference has no relation to the simulated data but SBC still sees no problem. It’s not like we have a buggy inv_logit that always returns zero passing autodiff tests. It’s like we have a working logitand a test framework that is designed to use the identity inv_logit(logit(x))==x but a buggy inv_logit that always returns zero still passes the test.

Thanks everybody for the input. I would just like to say that I am no trying to make a big point - it is just an observation that I hope can be useful to somebody. Just one more thing to watch out for. And maybe that plotting the precision of your estimates in some way might be a useful addition to SBC plots.

3 Likes

Do you have code to produce those scatterplots? It seems like all the data needed is available from the SBC process. Each time you run a model, you generate true parameters values which are typically stored in the pars_ vector. If there was a corresponding pars vector for the sampled parameters then it would be a small addition to the sbc function to report the difference between the true parameter and median sampled parameter?

Simulation-based calibration doesn’t check correctness, it just checks consistency. The method has no way to determine whether or not you accidentally defined the wrong observational model so long as the observational model is used consistently.

In order to identify a mistake like this you have to be beyond the confines of simulation-based calibration into more general calibrations, for example the z-score/shrinkage plot in https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#23_model_sensitivity would give you some indication of having used the wrong observational model when fitting your posteriors.

3 Likes