Parameter recovery vs. sampling diagnostics (rhat/sbc/divergences)

davidh · August 12, 2020, 8:54pm

I wanted to ask about a confusion of mine that has come up a few times recently. Many mathematical psychologists/cognitive scientists often worry about parameter recovery. Two twitter threads today (https://twitter.com/Nate__Haines/status/1293600213756715010, https://twitter.com/PeterKvam/status/1293540942700404741) discussed the importance of this for validating inferences in cognitive model-based analyses.
I think investigating parameter recovery makes a lot of sense for maximum likelihood estimation where uncertainty estimates are only approximate so it is risky to just go with the fitted parameters and we want to check how close we might be in ideal conditions.
However, now that many psychologists and cognitive scientists are switching over to Bayesian methods, it seems to me like “model recovery” conflates two things: model identifiability and accurate computation. I am wondering to what extent checks of model recovery for the purpose of accurate computation are superseded by diagnostics like SBC.

There is no real agreed upon way to check for model recovery in the psychology literature but usually it is done by looking at correlations of true parameters with estimates and hoping they are high enough.
With maximum likelihood, both lack of identifiability and accurate computation will lead to bad model recovery and therefore, I think this is a reasonable thing to do.
But with Bayesian methods, I think, lack of identifiability will be reflected in a very wide or multimodal posterior but will not be inaccurate if you simulate long enough to explore the whole thing AND you don’t have issues that are flagged by diagnostics (although these are, of course, imperfect and aren’t guaranteed to always work). My understanding though, is that SBC + divergences are currently the best way to tell if your posterior computation is inaccurate.

Both twitter threads mention that “parameter recovery” is useful as a way to do a power analysis and tell how much data you need to make your model identifiable. I am very on board with that. Parameter recovery and fake data simulation more generally is also a great way to understand and find problems with your model. But it seems to me like parameter recovery is largely irrelevant for checking computation. This has come up for me in two places recently:

I recently tried to fit a model from a recent mathematical psychology methods paper in Stan using their simulated data and got lots of divergences and high r-hat values. I also tried using their published JAGS code, which also resulted in several very high r-hats, even after tens of thousands of samples. The paper itself included a parameter recovery plot but no other mention of convergence issues. When I emailed the senior author, he responded that “Rhat is just a metric about the algorithm’s autocorrelation, not the model. So if you get high autocorrelation, that just means the chains have an expressed dependency, but the samples, once collapsed over iterations, should still be in the posterior range…I would note that the Rhat<1.1 rule is not the only thing that matters when it comes to assessing accuracy of a posterior estimate.” The author is a well known Bayesian mathematical psychologist so I’m worried that psychologists will think that a reasonable looking parameter recovery plot can substitute for other metrics like r-hat. Of course r-hat/divergences aren’t perfect but it seems like they should be worth paying attention to as it is easy to observe reasonable posterior mean performance on a few simulated datasets without accurate posterior computation.

I recently tried to publish a paper with an analysis that we fit using brms. Despite reporting that we did not observe any divergences/rhat issues with a large number of samples and having a posterior that unimodal and fairly tight, reviewers asked us to do a “parameter recovery.” We complied (showing correlations of the posterior mean with simulated true parameters) but it seemed sort of irrelevant to me given that the model didn’t appear to have any issues in identifying the relevant parts of the parameter space. If the reviewer had reason to think that the computation was inaccurate, it seems like SBC would be much more informative than parameter recovery. They might not have been familiar with SBC though and I didn’t feel confident enough in this to suggest doing SBC instead. Therefore, I’d love to know what best practices are here!

Overall, it seems to me that parameter recovery not ideal for judging accurate computation and that, after data have been collected, if the posterior suggests parameters have been identified with no sampling issues that it’s largely irrelevant for identifiability as well (assuming SBC results in no errors). That being said, I am not entirely confident in these assessments so it would be great to hear from those who are more knowledgeable! Is parameter recovery a useful diagnostic for checking Bayesian inferences?

Vanessa_Brown · August 12, 2020, 10:13pm

good question, and I’m curious to hear what others think. my thoughts:

parameter recovery is nice as a estimation-agnostic model check. other things like r-hat are specific to sampling-based approaches. in terms of coming up with ‘best practices’ for modeling, it’s nice to have things like parameter recovery that can always be used & don’t require in-depth knowledge of the specific estimation approach used.
i’ll defer to others on SBC, as I haven’t tried it, but being able to explore the posterior effectively and being able to recover parameters are separate issues. I have had models with good r-hats but bad recovery. I disagree with your anonymous mathematical psychologist correspondent about the need for good r-hats (also- isn’t r-hat based on comparing chains, not autocorrelation?), and would see model convergence as the first step, but if I’m drawing any conclusions about parameter values I would want to demonstrate good parameter recovery as well.

jonah · August 13, 2020, 12:08am

@davidh Thanks for this post. It brings up some important topics that there seems to be misunderstanding about in many places, like you point out.

I’m tagging some other people who I think might be interested in this thread because they’ve written about R-hat and/or SBC: @avehtari @andrewgelman @betanalpha @paul.buerkner @anon75146577 @Bob_Carpenter. I’m sure they will be amused by what your correspondent thought about convergence diagnostics, and maybe they can do a better job correcting it than I can. But here are a few comments in the meantime:

I agree with @Vanessa_Brown. The “senior author” who responded to you has some unfortunate misunderstandings about R-hat. In particular:

R-hat is basically a comparison of within chain to between chain variance. This is not unaffected by auto-correlation but, as a diagnostic, the effective sample size is more closely tied to auto-correlation. Also it’s not clear what they mean by “not the model”. R-hat certainly doesn’t tell you if your model is a good one for the phenomenon you’re studying, but the same algorithm will give different R-hat values for different models, so it’s not independent of the model.

Quoting @andrewgelman, R-hat was intended to address these questions:

Is the distribution of the first part of a chain (after warm up) the same as the distribution of the second half of the chain?

If I start the algorithm at two different places and let the chain warm up, do both chains have the same distribution?

(That’s from this blog post, which is actually about developing an improved version of R-hat)

I don’t think they are right about this. And this is different from R-hat. It’s true that high auto-correlation doesn’t necessarily mean you have bad estimates (although it often does because it can mean you haven’t actually explored enough of the posterior), but very high R-hat almost always indicates problems beyond simply autocorrelation issues. There’s nothing special about the particular cutoff value at which we say R-hat is high, but it’s not correct for your correspondent to say that in general high R-hat values can be ignored because it’s just autocorrelation and collapsing over iterations solves that.

Yeah that’s unfortunate. A reasonable looking parameter recovery plot is not a substitute for convergence diagnostics.

Yeah you’re totally right.

You’re right that SBC would be preferable here (and probably also that they haven’t heard of it). It’s not unrelated to parameter recovery, though. It just puts it in a more coherent framework.

davidh · August 13, 2020, 12:51am

Thanks @Jonah and @Vanessa_Brown! These were very insightful comments and it would be great to hear from any of the people you tagged. To be clear, I didn’t mean to endorse that professor’s understanding of R-hat (and I said r-hat but I was in fact using the folded/split-half/rank-normalized version from the blog post)! Only included it to show that that attitude seems to be out there – glad I’ll be able to point to these responses if I every encounter it again.

It’s true that high auto-correlation doesn’t necessarily mean you have bad estimates (although it often does because it means you haven’t actually explored enough of the posterior), but very high R-hat almost always indicates problems beyond simply autocorrelation issues.

Yep exactly, my concern that I emailed the author about was that I was running JAGS for 50,000 samples and still getting high r-hats, suggesting something else was going on. Not to mention the divergences when I ran it in my Stan version.

@Vanessa_Brown, I think I’m a bit confused by your second point – assuming good sampling, shouldn’t the inability to recover the parameters (i.e. lack of identifiability) be reflected by the posterior (so the posterior is very wide/multimodal)? The multimodal case might be difficult to explore but then usually (in my experience) that is reflected in the diagnostics. @Vanessa_Brown, when you say recover parameters, do you mean the correlation with the posterior mean?

This also leads to a question that was maybe below the surface of the original post but I’ve been confused about it since the SBC paper came out and I would love to hear future commenter’s thoughts on: if identifiability issues can be handled by a well-sampled posterior, then the reason to do SBC/parameter recovery is to check computation. Assuming very low r-hats (<1.01 etc), no divergences and a decent number of samples, does SBC (or parameter recovery) suggested by the reviewer add a lot to your confidence in the results? More broadly, should we always be doing SBC when we publish a paper? Stan+diagnostics has always seemed fairly robust to me but I guess there’s always the possibility that you’ve come up with a model that is uniquely bad for Stan’s sampler?

Stephen_Martin · August 13, 2020, 12:56am

I’ve been thinking something similar. I wish we could get away from “Simulating data from params, then recovering params” and toward “This model is estimating what it is supposed to, broadly” computational checking.

Either way, you’re simulating; and sometimes, that is prohibitively expensive in terms of time. But still - I’d so much rather just do SBC to show “the algorithm is working as it is supposed to” which implies that it can recover parameters anyway. But I think this concept is foreign to most reviewers in psychology. Really, the important thing for new methods, imo, is to show that the model and the estimator are doing what they are supposed to do; SBC seems like the best way to show this, and it puts aside questions of “What should I fix the parameters to?”

I think the hardest part of a simulation, aside from the time, is ‘picking’ a reasonable set of parameters. Really, this should be orthogonal to the question - If I want to prove my approach works, then SBC should be sufficient — “Across a whole host of possible parameter configurations, and data generated from these configurations, the probabilistic statements from the model posteriors hold up.” Our inferences are conditioned on the chosen ensemble of models anyway… I’m not sure it matters terribly what the exact true values are, nor what happens when a model is misspecified [unless that itself is the question of interest… but again, seems irrelevant in papers defining new methods, and one never knows if a model is misspecified in reality].

But I digress - I wish more people would push for computational verification, rather than parameter recovery.

jonah · August 13, 2020, 1:11am

Yeah I didn’t think you were endorsing it. Glad you’ll have something to point to going forward!

Ideally, yeah, although I don’t expect that to become the norm. It’s especially important when developing your own complicated models. Probably less important if you’re just doing e.g. a basic logistic regression with a function like rstanarm::stan_glm().

Another thing SBC can help catch is mistakes in your code. Not all kinds of mistakes, but definitely some kinds of mistakes will lead to bad SBC results.

Vanessa_Brown · August 13, 2020, 2:40pm

@Vanessa_Brown, I think I’m a bit confused by your second point – assuming good sampling, shouldn’t the inability to recover the parameters (i.e. lack of identifiability) be reflected by the posterior (so the posterior is very wide/multimodal)? The multimodal case might be difficult to explore but then usually (in my experience) that is reflected in the diagnostics. @Vanessa_Brown, when you say recover parameters, do you mean the correlation with the posterior mean?

When I think of recovering parameters, I think of simulating performance at different combinations of values and making sure you get something close to those values - so including, but not limited to, checking for identifiability issues. As an example, we had a model (Associability-modulated loss learning is increased in posttraumatic stress disorder | eLife - note I was just learning stan when I published this so wouldn’t recommend it as modeling best practices, but oh well) that had a learning rate (governing how quickly people updated values) and an associability weight (governing how much the learning rate was reduced when the chosen stimulus had predictable outcomes in the recent past). Those parameters were separately identified, but associability weight could only be recovered when the learning rate was fairly high, since with a low learning rate, there wasn’t a lot that associability weight could do. Also, much of my work involves comparing parameter values between groups, so if I have identifiability issues in a otherwise well-behaved model, that’s still an issue - but that’s essentially the power discussion from the original twitter threads.

I also agree with @Stephen_Martin, though in my (clinical/applied) subfield, I’d just be happy if everyone did basic stuff like parameter recovery! I continually see papers published, with possible real-world implications, that have glaringly obvious modeling issues. That’s why I tend to push having a set of straightforward checks as a starting point, but should be seen as that - a minimum.

betanalpha · August 13, 2020, 3:32pm

Sounds like this conversation is based on some very common misunderstanding about inference, calibration, and computation, both from the frequentist and Bayesian perspectives. I’ll try to deconvolve all of the basics here but for any serious depth I recommend checking out

Frequentist Inference as Calibration: https://betanalpha.github.io/assets/case_studies/modeling_and_inference.html#2_frequentist_inference

Bayesian Calibration in General: https://betanalpha.github.io/assets/case_studies/modeling_and_inference.html#33_bayesian_calibration

Bayesian Algorithmic Calibration: https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#step_ten:_algorithmic_calibration

Bayesian Inferential Calibration: https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#301_step_eleven:_inferential_calibration

Inference vs Calibration vs Computation

Model-based inference is any method of quantifying which model configurations are consistent, for some definition of “consistent”, with observed data.

Model-based calibration quantifies the range of inferential outcomes that could arise within the scope of a given model. This can be used to ask “if my modeling assumptions are sufficient then will my inferences behave well enough?”

Computation concerns how inferences are implemented in practice, especially when inferences can be implemented only approximately. Testing computational faithfulness of an algorithmic implementation can be considered as a model-based calibration where the computed inferences are compared to the true inferences.

Frequentist Inference

Frequentist methods don’t really define inferences – instead they define a particular way of calibrating inferences (worst case expected loss given a loss function and model). Much of the practice of frequentist inference is engineering estimators that achieve certain calibration goals, at least in some circumstances (unbiasedness, coverage, etc).

The most common frequentist estimator is the maximum likelihood estimator (so much so that many confound frequentist inference with maximum likelihood estimation completely). Under certain technical conditions maximum likelihood estimators will achieve certain properties as the data size grows infinitely large.

Formally those technical conditions and asymptotic validity should be checked in every use of maximum likelihood estimation but usually they’re just taken for granted. Moreover, the checks that are made are often heuristic and have long since drifted away from their original goals!

For example one of the requirements for maximum likelihood estimators to have the desired frequentist calibration properties is that the model is identified https://betanalpha.github.io/assets/case_studies/identifiability.html#1_Identifiability. Once consequence of identifiably is that with enough data the maximum likelihood estimator should be close to the true model configuration from which data are generated. In other words “parameter recovery” tests provide some sensitivity to the assumptions underlying ideal maximum likelihood estimator behavior, although it is by no means a sufficient condition!

Bayesian Inference

Bayesian inference separates inference from decision making and calibration, which is one of the reasons why it’s so useful in practice.

Inferences are constructed from an application of Bayes’ Theorem, resulting in a posterior distribution that quantifies the model configurations consistent with the observed data (and domain expertise encoded in the model) probabilistically. Formal inferences take the form of posterior expectation values – in other words we ask questions in the form of functions and the posterior answers with expectation values of those functions.

Because we’ve separated inference from calibration in general there are no guarantee that a posterior will be close to the truth! This is true even if there is no misfit and the data are generated from one of the model configurations in the assumed model. Without any additional assumptions there is zero basis for “parameter recovery” tests of Bayesian inferences.

Bayesian calibration considers the range of behaviors of posterior distributions induced from prior predictive data. In other words we repeatedly simulate model configurations from our prior model and data from those model configurations (alternatively we simulate from the joint Bayesian distribution and project out the data) and then investigate the ensemble of resulting posterior distributions. The probabilistic nature of this calibration has two huge advantages to the worst-case frequentist calibration: it’s much easier to (approximately) implement in practice using simulations and it focuses on the model configurations consistent with our prior model instead of more extreme model configurations.

Just as an arbitrary frequentist estimator may be useless in practice, so too may be a Bayesian posterior. Bayesian calibration can be used to test a given Bayesian model, including the experimental design inherent to the observational model and the domain expertise elicited in the prior model, to see if it result in posterior distributions that are sufficiently precise and accuracy for the given inferential goals. Only in this context does parameter recovery become well-defined as a measure of Bayesian accuracy.

Finally there’s the question of computational faithfulness. In practice we can construct posterior expectation values, and hence inferences, only approximately. If the approximation error is too large then our recovered inferences will be too corrupt to be trustworthy. Even worse, when we can’t trust our computation then we can’t implement inferential calibration! For example comparing corrupted posteriors to the truth does not in general constrain how the true posteriors will relate to the truth! Markov chain Monte Carlo diagnostics like divergences and Rhat can identify corruption for a single posterior fit where as Simulation-Based Calibration (really better named Simulation-Based Algorithmic Calibration) uses an entire ensemble of posterior fits induced from prior predictive data to identify problems.

To summarize: Bayesian inference decouples inference, computation, and calibration into separate steps. This allows us to better understand each individual step but it also requires us to be disciplined enough to implement and verify each step in our own analyses. Fortunately these steps all fit into a convenient sequential workflow.

Given a model we can study the faithfulness of our Bayesian computation within the scope of our model with algorithmic calibration. Once we trust our computation then we can study the range of posterior behaviors within the scope of our model with inferential calibration. If our inferences are expected to do well enough within the scope of our model assumptions then we can validate those model assumptions by comparing posterior retrodictions to the observed data; if we can’t resolve any serious discrepancies that we might assume that the model-based inferential calibrations are relevant to the behavior of posterior distributions realized from real data. For a much more detailed discussion see https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html.

I should also note that algorithmic calibration and inferential calibration are further confounded by the fact that inferential problems can often manifest as algorithmic problems in which case the most effective way to resolve algorithmic problems is to improve the model (and implicitly improve the inferential calibration). For more see https://betanalpha.github.io/assets/case_studies/identifiability.html.

Interpreting Bayesian Parameter Recovery

Parameter recovery is fundamentally an issue of inferential calibration. Good parameter recovery requires sufficiently informative observations as encoded in the likelihood function and/or domain expertise as encoded in the prior model.

If the computation is sufficiently faithful then poor parameter recovery indicates a model that isn’t sufficient for the inferential goals.

If the computational faithfulness hasn’t been checked then poor parameter recover could be due to an insufficient model or corruption by approximation error. There’s no way to decouple those two without checking the computation independently. In fact if you don’t check the computational faithfulness then even good parameter recover doesn’t necessary mean that the model is good! There are many stories of corrupted computation implicitly regularizing posteriors and compensating for modeling problems – just search the Stan Forums for people complain about their model fitting fine in older packages but Stan complaining about all kinds of issues!

Ultimately parameter recover can be very helpful but only when it is being interpreted correctly by all parties involved. Unfortunately due to common pedagogical misunderstandings this is rarely the case, so it’s helpful to respond to requests by asking for what the requester is interested in testing. Is it computational faithfulness or inferential calibration? Once the right context has been identified then the request can be handled in a principled way.*

*Okay, in an ideal world. I’ve also has to deal with “senior researchers” unaware of all of this (even when it’s carefully explained in the manuscript under review) and obstinately asking for the heuristics that have become conventional for them. So consider all of my advice in the vein of “what you should do” instead of “what will get a paper published/not upset politically powerful people”.

andrewgelman · August 13, 2020, 8:01pm

I agree with what Michael said. For some reason these issues are hard to convey to people.

My Ph.D. advisor Don Rubin told us that frequentism is a method of evaluating inferences, not a method of creating inferences, hence when speaking of “frequentist inferences,” you always have to say what frequentist inference you’re talking about. This is different from Bayesian inference which, given the model and data, is mathematically defined. I’ve made this point many times (for example here’s one I found in a quick google search: https://statmodeling.stat.columbia.edu/2018/03/24/economist-wrote-asking-make-sense-fit-bayesian-hierarchical-models-instead-frequentist-random-effects/).

One could make the argument that the openness of frequentist inference is a strength of the frequentist approach. I wrote about this here: https://statmodeling.stat.columbia.edu/2011/06/20/the_sampling_di_1/ in a post called “The pervasive twoishness of statistics.”

People have also argued (correctly, in my view) that there’s a fundamental commonality of Bayesian and frequentist approaches, in that the Bayesian prior plays the same role as the frequentist reference set: a distribution that we average over to make coverage statements. I feel like I put this in a paper at some point, but all I can find right now is this blog post entitled “Bayesians are frequentists”: https://statmodeling.stat.columbia.edu/2018/06/17/bayesians-are-frequentists/

So, yeah, there’s lots to be confused about!

hyunji.moon · August 17, 2020, 6:34am

I have been working on SBC for several months and there were concepts I wished to be clarified. Based on the explanation from you and @andrewgelman, I have summarized and developed thoughts (inferences to be checked and questions) which I am looking forward to Stan community’s feedback.

calibration notation

standards

Self-consistent: standard of how consistent the model is with itself
Env-consistent: standard of how consistent the model inference is with the observed data

methods

IC: inferential calibration

IC_self: calibration based on self-consistency standard. e.g. rank uniformity between generated and inferred parameter
IC_env: calibration based on env-consistency standard. e.g. posterior retrodictive/predictive check

AC: algorithmic calibration

results

Ideal posterior range (IPR): posterior range where output statistics satisfy the standards

Ideal model configuration (IMC): model configuration, as a combination of prior and likelihood, which has many-to-one mapping to ideal posterior range. Different (prior, likelihood) could result is the same posterior range.

IMC(IC_env, IC_self, AC_self): Ideal posterior range that satisfies each calibration standards

Workflow

Inference

Errors from insufficient model and corruption by approximation could be decoupled by comparing AC_self and IC_self rank uniformity
As rank uniformity test is the result of ensemble of posterior distributions resulting from multiple parameter recovery, parameter recovery comparison is not necessary. Different rank uniformity result implies corruption by approximation.
IPR(AC_self) = IPR(IC_self) - {posterior spaces where approximate model assumptions (e.g. unimodal, log-concave) does not hold}
Calibration based on self-consistency is for IMC while env-consistency is for IPR.
For example, two (prior, likelihood) pair that corresponds to the same posterior distribution have same env-calibration results but, different rank uniformity results.
Based on the above, the following rough draft expresses the inclusion relation between ideal prior/likelihood range based on different calibrations and in the end, my opinion on each model’s usage.

Ideal prior range: self calibration < env calibration.

In IC_env, the effect of misspecified priors could be offset by the data. But in self calibrations, prior has a double effect, one for simulating y, then for fitting the generated data. From the experiment, priors that reproduced the observed data failed a uniformity test.
There could be prior models that are self-consistent but not able to reproduce the observed data, but this could be solved by increasing data amount.

Ideal likelihood range: self-calibration ~ env calibration

When prior and likelihood are misspecified it would fail self-calibration rank test but might pass env_calibration test.
There could be likelihood models that are self-consistent but not able to reproduce the observed data.

Question

justification for the relationship between IPR(IC_self) and IPR(IC_env)
I observed that env-consistent models that explain observed data well tend to be more self-consistent (e.g. when prior mean is set as 0, many divergences → when updated to emp.mean from prior fit, no divergence). Could there be any reasonable justification for this?
Using IMC(IC_env) as an initial point for self-consistency calibration
Self-calibration cannot utilize outside information to determine its model configuration, but its results depend on the model configuration, especially on prior. For example, I have encountered cases where the problem of divergence and ununiform ranks being solved by updating only the prior. Certain references, at least an initial range of prior exploration could boost the calibration efficiency. Is IMC(IC_env) apt to serve as an initial range? Considering SBC’s heavy computation, pre-specified IMC from env_consistency test could be helpful, if they have some connection (returning to the previous question).

–
link to the diagram; feel free to edit for feedback, everyone! Thanks

Raoul-Kima · August 17, 2020, 3:49pm

His statement seems correct to me in the given context. I think the issue we’re discussing here is that Rhat is just a metric to assess the convergence/mixing of an algorithm, whereas the term “parameter recovery” can be used to specify a quality of a model itself. So thses terms can be used in a way where they are completely unrelated. I think this is what he wanted to say. Rhat tells you something about whether the algorithm could fit your model to your data, whereas parameter recovery tells you whether your model gives you correct answers when fit to data. In order to have good inference you need both, a model with good parameter recovery and an algorithm that can fit that model to your dataset.

As JAGS isn’t using HMC it is typical for some models to require very high iteration numbers. Without knowing the model and without knowing how many iterations it was left to run in the paper I can’t know whether they got a good sample from it.
And: JAGS can fit some types of models without problem where Stan throws divergences. So the divergences alone don’t necessarily indicate a problem (for JAGS) either. (Stan requires all model variables to have a continous and somewhat smooth posterior, while jags doesn’t require either of these things)

I don’t want to say anyone is wrong here, just want to point out to be careful with what you read into the email answers mentioned in the opening post.

davidh · August 17, 2020, 4:27pm

Thanks @betanalpha, @andrewgelman and @hyunji.moon! This was really helpful and I hope to be able to respond when I’ve had more time to process all of this information – I think I’m maybe still a bit confused as to how this applies to the specific situations I was discussing (particularly the second) but I need to think about it more before I can properly frame the question.

@Raoul-Kima, I just wanted to note that I think the problem was the second half of the statement, i.e. “the samples, once collapsed over iterations, should still be in the posterior range”. “In the posterior range” is not a very precise statement but I think for any way of making it precise, it is not at all guaranteed to be true if your chains have severe autocorrelation (or are otherwise not sampling from the same distribution after warmup, which is closer to what Rhat actually measures, as @jonah said above, paraphrasing @andrewgelman)

Raoul-Kima:

davidh:

my concern that I emailed the author about was that I was running JAGS for 50,000 samples and still getting high r-hats, suggesting something else was going on. Not to mention the divergences when I ran it in my Stan version.

As JAGS isn’t using HMC it is typical for some models to require very high iteration numbers. Without knowing the model and without knowing how many iterations it was left to run in the paper I can’t know whether they got a good sample from it.
And: JAGS can fit some types of models without problem where Stan throws divergences. So the divergences alone don’t necessarily indicate a problem (for JAGS) either. (Stan requires all model variables to have a continous and somewhat smooth posterior, while jags doesn’t require either of these things)

It is true that JAGS does not check for divergences because it’s not defined in the Gibbs sampler. The fact that it doesn’t throw the error does not mean that it accurately fit the model without problem, simply that you don’t have the relevant diagnostics to diagnose the problem. There are many others in this thread that are more expert than I at addressing this though.

In this particular case, the example in the paper ran JAGS for 2000 iterations. Maybe incredibly, they even said in the paper that you should check for convergence with Rhat but did not report Rhats for their own simulation. When I ran the code exactly as provided in the paper, I was getting high Rhats even after 50,000 iterations which is why I reached out to the authors.

Raoul-Kima · August 17, 2020, 5:48pm

Good point!

Yes, but discrete and unsmooth posteriors can create Problems with HMC while they are ok with JAGS. It’s not just that JAGS doesn’t throw warnings for them, they are actually safe to use there. I’m not saying that this is the case in the particular example we’re talking about here. It’s just a general remark that divergences with HMC don’t automatically mean that “lesser” Samplers have a problem with the model, too.

Oh yeah, this is bad in the light of the diagnostics you provided.

jonah · August 17, 2020, 6:31pm

Yeah this is true in some cases because there are models that are very much geared towards Gibbs samplers. In the vast majority of cases if there are divergences in Stan then the model is problematic for JAGS too, but it’s true that there are some special cases. That said, even in cases where JAGS does discrete sampling without errors, it doesn’t necessarily mean it has explored the posterior sufficiently. In many cases (not all) discrete parameters are hard even for algorithms that support discrete parameters. There are few (if any) diagnostics that will tell you when you’re in that situation though.

betanalpha · August 19, 2020, 4:28pm

First let me comment about vocabulary. The word “calibration” is often used to denote two distinct but related concepts. On one hand calibrate can be used to mean “quantify possible outcomes”, as in “change one’s expectations to match the possible outcomes”. On the other hand it can be used to mean “check if possible outcomes are compatible with preexisting expectations”.

Inferential calibration is an example for the former – we have no reason to expect any particular posterior behavior so we’re just investigating the inferential consequences of our model. Algorithmic calibration is an example of the latter – we identify special circumstances where self-consistency conditions imply certain consequences and then check to see if our computed inferences are compatible with those outcomes.

I wouldn’t refer to these as calibrations because they conditioning on one particular event – the observed data – instead of considering all possible events.

I don’t think this is possible without going beyond SBC, for example by using self diagnostics like divergences to identify algorithmic problems and then assume that any residual SBC errors are entirely due to model approximation.

Self-consistency checks like SBC will work for any prior model that’s a proper probability distribution. Different prior models, however, can expand the scope of model configurations that are considered in SBC resulting in more extreme observations and nastier likelihood functions/posterior density functions that break algorithms or emphasize flaws in approximate posterior models.

Consider for example an observational model based on the Student-t family and an algorithm that is accurate for only normal density functions. When using a prior model that concentrates are large degrees of freedom, and very normal like densities, the algorithm may pass all tests. When using a prior model that incorporates smaller degrees of freedom, however, the algorithm will start to fail and those failures will manifest in self-consistency checks like SBC.

This doesn’t sound like a self-consistent procedure. Once the model has been specified the data generating and fitting are fixed, so I’m not sure why you are trying to change the model here.

The problem is that you can’t implement an environment check, per your language if I’m understanding it correctly, if your computation isn’t accurate. That’s why my recommended workflow is to check the internal consistency of a given model first, and only then consider its fit to the data as discrepancies should be manifestations of misfit and only misfit.

Given the follow up sentence “So if you get high autocorrelation, that just means the chains have an expressed dependency, but the samples, once collapsed over iterations, should still be in the posterior range” I am comfortable saying that the author is wrong.

Firstly there is are the common confusions between the colloquial use of the terms “mixing” and “convergence” with the formal treatment of Markov chain Monte Carlo. Technically what matters is (1) whether or not a Markov chain Monte Carlo central limit theorem exists (2) you’ve run long enough for the central limit theorem to kick in and (3) once it’s kicked in you’ve run long enough to get sufficiently accurate posterior expectation estimators.

Autocorrelations for Markov chains are well-defined only if (1) and (2) are valid, in other words only when we’ve already reached (3) Although Rhat can be connected to autocorrelations (3), see for example [1812.09384] Revisiting the Gelman-Rubin Diagnostic, its real utility comes as a diagnostic for failures in (1) and (2). In other words in most cases large Rhat definitely implies that the samples do not fall into “the posterior range”.

As I wrote above it is important to separate out algorithmic calibration from inferential calibration, but at the same time both of those steps have to be treated correctly.

Just to clarify – parameter recovery is a question of inferential calibration, not algorithmic behavior. It applies when you want to ask “what is the scope of possible inferential outcomes provided that my model is correct?”, which is ultimately largely a question of experimental design and complementary prior modeling. It’s an important question but one that needs to be decoupled from questions of algorithmic behavior.

This cannot be repeated enough times!

“Safe” is not a well-defined word in Markov chain Monte Carlo.

The Gibbs samplers in JAGS admit discrete and non smooth target densities in the sense that they are asymptotically consistent – after an infinite number of iterations the Markov chain Monte Carlo estimators will converge to the true target expectations. But that asymptotic consistency means nothing for the behavior of the estimators after only a finite number of iterations.

The challenge to applying Markov chain Monte Carlo in practice is quantifying the finite time, i.e. the practical, errors. This is most feasible with a central limit theorem construction as I discussed above which is incredibly challenging due to the lack of sufficient conditions; instead we are limited to a few necessary conditions and diagnostics and identify violations of those conditions.

The fact that Stan keeps identifying problems in model fits from Gibbs samplers that people had taken for granted demonstrates just how weak of a condition asymptotic consistency is by itself. For most Gibbs samplers the only available diagnostic is Rhat which is far better than nothing but is often only weakly sensitive to problems in practice.

The implementation of Hamiltonian Monte Carlo in Stan is also asymptotically consistency even when there are divergences. Instead divergences are sensitive diagnostics to the failure of finite iteration convergence, which is why they are so powerful in practice.

So yes technically a Gibbs sampler could perform okay for target densities that Stan cannot handle, but the user is still responsible for verifying the finite time behavior of the Gibbs sampler before making any claim about its utility in a given application. When the target density is amenable to both Gibbs sampling and HMC we have empirically not seen any nontrivial cases where the Gibbs sampler achieves good finite iteration behavior while HMC does not.

Basically what @jonah said.

Rstanlearner · August 21, 2020, 2:17am

Hi, thanks for posting this. I’ve been thinking convergence (rhat) and parameter recovery as well. In your first case, the paper has good parameter recovery but bad rhat, what do you suggest them to do then?

andrewgelman · August 30, 2020, 6:00am

Rstanlearner:

We like to refer to Rhat as measuring “mixing” rather than “convergence” just because literally what Rhat does is compare the distribution of the mixed chains to the distributions of the original chains.

I don’t see how a paper can have good parameter recovery but bad Rhat. If it has bad Rhat, it means that the chains are not stable or that different chains are converging to different places, in which case some of the chains may be doing good parameter recovery but some are not.

Rstanlearner · August 30, 2020, 3:49pm

Thanks for your explanation! I shouldn’t say bad Rhat and good parameter recovery. It should be not perfect convergence rate (below 100%) and good parameter recovery, any ideas on what does it indicate and how to improve convergence rate?

Topic		Replies	Views
New preprint on model/algorithm validation with SBC - It is stronger than you thought! Publicity papers , simulation-based-calibration	0	652	November 7, 2022
Parameter recovery vs. sampling diagnostics General rstan	7	485	September 30, 2020
Simulation-based calibration: Some challenges and directions for future research Modeling simulation-based-calibration	1	611	September 19, 2021
Reviewing papers using bespoke models - demand evidence of calibration? General	1	550	March 1, 2019
Questions and comments about SBC Algorithms simulation-based-calibration	9	914	April 29, 2018