I wanted to ask about a confusion of mine that has come up a few times recently. Many mathematical psychologists/cognitive scientists often worry about parameter recovery. Two twitter threads today (https://twitter.com/Nate__Haines/status/1293600213756715010, https://twitter.com/PeterKvam/status/1293540942700404741) discussed the importance of this for validating inferences in cognitive model-based analyses.
I think investigating parameter recovery makes a lot of sense for maximum likelihood estimation where uncertainty estimates are only approximate so it is risky to just go with the fitted parameters and we want to check how close we might be in ideal conditions.
However, now that many psychologists and cognitive scientists are switching over to Bayesian methods, it seems to me like “model recovery” conflates two things: model identifiability and accurate computation. I am wondering to what extent checks of model recovery for the purpose of accurate computation are superseded by diagnostics like SBC.
There is no real agreed upon way to check for model recovery in the psychology literature but usually it is done by looking at correlations of true parameters with estimates and hoping they are high enough.
With maximum likelihood, both lack of identifiability and accurate computation will lead to bad model recovery and therefore, I think this is a reasonable thing to do.
But with Bayesian methods, I think, lack of identifiability will be reflected in a very wide or multimodal posterior but will not be inaccurate if you simulate long enough to explore the whole thing AND you don’t have issues that are flagged by diagnostics (although these are, of course, imperfect and aren’t guaranteed to always work). My understanding though, is that SBC + divergences are currently the best way to tell if your posterior computation is inaccurate.
Both twitter threads mention that “parameter recovery” is useful as a way to do a power analysis and tell how much data you need to make your model identifiable. I am very on board with that. Parameter recovery and fake data simulation more generally is also a great way to understand and find problems with your model. But it seems to me like parameter recovery is largely irrelevant for checking computation. This has come up for me in two places recently:
- I recently tried to fit a model from a recent mathematical psychology methods paper in Stan using their simulated data and got lots of divergences and high r-hat values. I also tried using their published JAGS code, which also resulted in several very high r-hats, even after tens of thousands of samples. The paper itself included a parameter recovery plot but no other mention of convergence issues. When I emailed the senior author, he responded that “Rhat is just a metric about the algorithm’s autocorrelation, not the model. So if you get high autocorrelation, that just means the chains have an expressed dependency, but the samples, once collapsed over iterations, should still be in the posterior range…I would note that the Rhat<1.1 rule is not the only thing that matters when it comes to assessing accuracy of a posterior estimate.” The author is a well known Bayesian mathematical psychologist so I’m worried that psychologists will think that a reasonable looking parameter recovery plot can substitute for other metrics like r-hat. Of course r-hat/divergences aren’t perfect but it seems like they should be worth paying attention to as it is easy to observe reasonable posterior mean performance on a few simulated datasets without accurate posterior computation.
- I recently tried to publish a paper with an analysis that we fit using brms. Despite reporting that we did not observe any divergences/rhat issues with a large number of samples and having a posterior that unimodal and fairly tight, reviewers asked us to do a “parameter recovery.” We complied (showing correlations of the posterior mean with simulated true parameters) but it seemed sort of irrelevant to me given that the model didn’t appear to have any issues in identifying the relevant parts of the parameter space. If the reviewer had reason to think that the computation was inaccurate, it seems like SBC would be much more informative than parameter recovery. They might not have been familiar with SBC though and I didn’t feel confident enough in this to suggest doing SBC instead. Therefore, I’d love to know what best practices are here!
Overall, it seems to me that parameter recovery not ideal for judging accurate computation and that, after data have been collected, if the posterior suggests parameters have been identified with no sampling issues that it’s largely irrelevant for identifiability as well (assuming SBC results in no errors). That being said, I am not entirely confident in these assessments so it would be great to hear from those who are more knowledgeable! Is parameter recovery a useful diagnostic for checking Bayesian inferences?