The plots shown in Figure 9 compare the density of the computed LOO-PITs (thick
dark line) versus 100 simulated datasets from a standard uniform distribution (thin light
lines). We can see that, although there is some clear miscalibration in all cases, the
hierarchical models are an improvement over the single-level model.
The shape of the miscalibration in Figure 9 is also meaningful. The frown shapes
exhibited by Models 2 and 3 indicate that the univariate predictive distributions are too broad compared to the data, which suggests that further modeling will be necessary to
accurately reflect the uncertainty.
The axes of these figures are not labeled. Why did you choose to do that? Y is the density and X quantiles? – I just anticipate that people would yell at me if I try to use a figure without labels –
How would this figure look like if your model is really good?
How would you read the following figure? That is, what conclusions, if any, can you make from it? Are there any conclusions that you have seen people make from it but that they should not?
The y-axis text is off by default because we wanted to emphasize comparing the shapes rather than the numbers. There’s not too much to infer just from the the y-axis numbers for a plot like this, but if you want to add them into to the plot so you don’t get yelled at you can just do this:
ppc_loo_pit_overlay(...) + yaxis_text()
The darker and thicker curve doesn’t stand out (aside from being darker and thicker). Basically the same thing as for ppc_dens_overlay. The dark curve should be plausible. I recommend making some examples for yourself by simulating data, fitting models, and making the plots (i.e., make some plots for cases where you know the correct model). Here’s one for a linear regression model that I just fit using a simulated dataset of 1000 observations:
The one in the image you shared looks decent. These plots can be somewhat noisy (you’ll see this if you try my recommendation of doing simulations) so you don’t want to read too much into every little discrepancy, but rather try to identify serious problems. In the ones from the paper there is a clear pattern to the miscalibration (the frown shapes), whereas in this one you posted the dark blue line seems like it could plausibly be one of the thin blue lines. Maybe there’s a bit of an issue around 0.3, but at first glance maybe not so severe that it’s not plausible (see, e.g., the dip near 0.7 in my example above). You can also use the optional samples argument (default is 100) to add more thin lines, which is sometimes useful, but 100 is often plenty.