Understanding LOO-PIT graphical diagnostics

This question is primarily about the interpretation of diagnostics.

Here’s my situation: I fit a logistic regression model to predict the outcome of a binary event. The model fits without issue and I’m current in the process of doing model checking.

Part of my model checking has included using arviz to create loo_pit plots. The first is the default and the second is “the difference between the LOO-PIT Empirical Cumulative Distribution Function (ECDF) and the uniform CDF”.

Here are the plots:


In this first plot, I feel like the fit is mostly okay with an issue around 0.3.

The ECDF plot is below:


Again, I believe to see the issues around 0.3, but here I’m not quite sure what the plot straying out of the credible interval around 0.7-0.8 tell me. The first plot looks fine in that region.

My questions are as follows:

  1. In the first plot, what kind of conclusions can I draw from the visual deviations from the Uniform distribution other than “my model has some issues in certain areas”?
    a) I’ve taken a look at posts such as this as well as reading the relevant section in BDA3 to get a better idea of what I’m looking at, but I still seem to not quite know how to interpret these plots.
  2. How does the second graph differ in the information it’s giving me compared to the first? I feel like I’m seeing some contradictory information with the second graph straying out of the 94% credible interval around 0.7-0.8, but that region looks perfectly in line in the first graph.
  3. If you were to see these diagnostic graphs, what would your next step in finding issues be? I certainly have some posterior predictive checks in mind, but what I’m interested in is: do these LOO-PIT graphs inspire particular posterior predictive checks?

I apologize if the questions aren’t particularly informed, LOO-PIT is new to me. Appreciate any responses!


I don’t know how to interpret these plots as “indicating problems in certain areas”, only how to get general trends like over/under dispersion and bias (this last one missing from the blog post you mentioned, but I do have some examples at my blog in a similar blogpost). Moreover, the ecdf difference plots are more clear to me in general. In this particular case, it looks like the model has some slight bias and under-dispersion at the same time.

I therefore get the same info from both plots, but as I said, I find the ecdf diff easier to interpret.

Also worth noting, the current implementation of the confidence bands follows this appendix from the rank rhat and ess paper which does not take the dependencies between quantiles, it will be updated during the summer to follow [2103.10522] Graphical Test for Discrete Uniformity and its Applications in Goodness of Fit Evaluation and Multiple Sample Comparison instead. That being said, it looks like you have a lot of observations so the result should not change too much I think (still haven’t played much with the differences beween envelopes).

I would recommend both ppc, I can’t really recommend anything particular without seeing also the model though, as well as looking at the pointwise elpd values and khat diagnostics


@OriolAbril, thank you for sharing your blog post! It was a helpful read.

My language “indicating problems in certain areas” was pretty crummy, my apologies.

As to your conclusions to draw from the plots – I can see where one would make the conclusion that it appears the model has slight bias, the ECDF difference looks broadly similar to the one in the “Gaussian: Biased model” figure of your blog post (albeit mostly within the credible interval). As to underdispersion, what features of the graph are you using to make that conclusion?

Appreciate the advice on further checks. Some additional LOO diagnostics are:

Computed from 4000 by 4877 log-likelihood matrix

         Estimate       SE
elpd_loo -3313.70    11.44
p_loo       67.76        -

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)     4877  100.0%
 (0.5, 0.7]   (ok)          0    0.0%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

There’s certainly areas where I need to improve the model – I think devising some targeted posterior predictive checks based on where it seems to be performing sub-optimally is my best bet for going forward.

For completeness, I’ll include my model code, which is a Bradley-Terry like model for a team game:

data {
    int<lower=1> N; // num games
    int<lower=1> J; // num players
    int<lower=1,upper=J> X[N, 6]; // player indices
    int<lower=0,upper=1> Y[N]; // game outcome, 1 => first team wins
parameters {
    vector<lower=0>[J] beta_raw; // player coeffs
    // hyperparameters
    real<lower=0> mu;
    real<lower=0> sigma;
transformed parameters {
    vector<lower=0>[2] team_coefficients[N];
    vector[N] game_coefficients;
    vector[J] beta = mu + sigma*beta_raw;
    for (n in 1:N) {
        team_coefficients[n,1] = sum(beta[X[n,1:3]]);
        team_coefficients[n,2] = sum(beta[X[n,4:6]]);
        game_coefficients[n] = team_coefficients[n,1] - team_coefficients[n,2];
model {
    sigma ~ exponential(1);
    mu ~ normal(1, 1);
    beta_raw ~ std_normal();
    Y ~ bernoulli_logit(game_coefficients);
generated quantities {
    vector[N] log_likelihood;
    int y_hat[N];
    for (n in 1:N) {
        log_likelihood[n] = bernoulli_logit_lpmf(Y[n] | game_coefficients[n]);
        y_hat[n] = bernoulli_logit_rng(game_coefficients[n]);

Probability integral transformation (PIT) is not good for binary data. I think ArviZ should give a warning about that. The current version just smooths the problem under the carpet. See instead Section 4.4 Calibration of predictions in Bayesian Logistic Regression with rstanarm


If there were only bias, the ecdf difference would be different than 0 already at very low quantiles, as it happens in the example plots in the blogpost. If there are multiple effects at hand (like will happen in most real scenarios) then what we see is a combination of the effects. “Combining” the ecdf-diff “ᴎ” shape characteristic of underdispersion with the ecdf-diff “\mathbf{\cap}” shape characteristic of bias (not sure right now if for positive or negative bias, please check) explains better the shape you see. In the same way that the last plot in my blogpost is a “combination” of a “N” and an “U”.

I have opened an issue to tackle both documentation and code issues with az.plot_loo_pit. I am not sure how useful loo_pit can be for binary data, even with smoothing but I think that the general question about understanding and interpreting loo pit graphical diagnostics is worth answering even if it may not be applicable to your specific situation.

In addition to the calibration plots that @avehtari linked to (not yet present in ArviZ) there is also the “separation plot”, specific for binary data. See arviz.plot_separation — ArviZ dev documentation and the reference there.

1 Like

@avehtari thank you! I had made a similar calibration plot that indicated an under-confident model and was trying to figure out how to square the PIT vs that calibration plot. Appreciate the clarification.


All the points in that issue you opened seem like good ideas to me.

Would it be worthwhile to consider adding calibration plots to Arviz? Similar to those referenced by @avehtari. It could be a useful expansion of diagnostics for models of binary data. In my case, I reached for one implemented by sklearn, but that is a bit limited.


Definitely, it’s already on our roadmap: ArviZ 2021 roadmap · arviz-devs/arviz Wiki · GitHub and if time permits will be added this summer as part of one of our GSoC projects

1 Like