Checking calibration visually for a categorical multilevel model

Hi. I’m trying to find a sensible way to visually assess the goodness of fit, or calibration, of a categorical multilevel model with 4 response categories and 53 population-level parameters.

Would it work if we just generalized the procedure seen on p. 253 of Gelman Hill and Vehtari (2020) to categorical outcomes in a simplistic way, i.e. by using the exact same procedure but repeating it for each response category in turn, and also for both the posterior fit and the LOO fit?

That’s what I do below, using 30 bins per plot because that size yields ~74 obs per bin (the book example has 75). And for each bin, the shaded uncertainty region is calculated simply as \pm1.96\times\sqrt{\frac{\hat{p}(1-\hat{p})}{\text{binsize}}}. The result looks as below, with uncorrected posteriors compared to observed proportions on the left and LOO posteriors compared to them on the right.

The plots seem at least somewhat useful on first glance. Does this approach make sense? Or is there a better one?

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories (Analytical Methods for Social Research). Cambridge: Cambridge University Press. doi:10.1017/9781139161879

Sorry to be ignorant here, but what do you mean by the LOO fit here?

Is there a reason you want to do this visually rather than just reporting LOO’s approximate expected log probability density (ELPD)?

You can do this or you can aggregate them in some way. So it’s just like chi-square tests or posterior predictive checks—you can create your own bins any way that makes sense.

For example, when I was looking at dentists rating X-rays for caries (a pre-cavity), there were 5 dentists rating each of 3K X-rays as positive or negative. For goodness of fit I looked at number of X-rays where 0 dentists, 1 dentist, 2 dentists, etc. said they had cavities rather than either looking at marginals per rater or looking at all 32 possible discrete outcomes. I did this because I expected there to be correlations among the raters’ responses that weren’t captured by the simple Dawid and Scene model I was using, and the discrepancies were really clear when plotting this way. The problem is that some marginals may look fine and others look bad.

1 Like

I mean forming the probability bins on the basis of expected values calculated from the LOO posterior rather than from the raw, uncorrected posterior. I could be wrong, but it seems to me that just as R2 metrics are often calculated both for the uncorrected and LOO posterior, it makes sense to do the same when evaluating the model fit visually, e.g. in order to detect serious overfitting.

Isn’t it a different question whether the model “fits” i.e. represents a plausible approximation of the data-generating process in the sample at hand, as opposed to how high the model’s predictive power is (as measured by elpd_loo or other metrics derived from the LOO posterior)? That’s certainly been my impression for several years now.

Sounds to me like you have an ordinal response there. I’m not that lucky.

Thank you!