Two different answers from ppc_loo_pit_qq

When using rstanarm with the roaches dataset I get two different charts of the goodness of fit of a model. The model - which is taken from Aki Vehtari’s notebook on cross-validation model for model selection is:

stan_glmnb <- stan_glm(y ~ sqrt_roach1 + treatment + senior, offset = log(exposure2),
                      data = roaches, family = neg_binomial_2, 
                      prior = normal(0,2.5), prior_intercept = normal(0,5),
                      chains = 4, cores = 1, seed = 1704009, refresh=0)
# create loo object
loo_glmnb <- loo(stan_glmnb, save_psis = TRUE)

Now following exactly the notebook we have:

loo_pit <- function(y, yrep, lw) {
  exp_log_sum_exp <- function(x) {
    m <- suppressWarnings(max(x))
    exp(m + log(sum(exp(x - m))))
  }
  pit <- vapply(seq_len(ncol(yrep)), function(j) {
    sel_min <- yrep[, j] < y[j]
    pit_min <- exp_log_sum_exp(lw[sel_min,j])
    sel_sup <- yrep[, j] == y[j]
    pit_sup <- pit_min + exp_log_sum_exp(lw[sel_sup,j])
    runif(1, pit_min, pit_sup)
  }, FUN.VALUE = 1)
  pmax(pmin(pit, 1), 0)
}

ppc_loo_pit_qq(pit=loo_pit(y = roaches$y, yrep = posterior_predict(stan_glmnb),
                           lw=weights(loo_glmnb$psis_object))) + 
  geom_abline() +
  ylim(c(0,1))

This results in the following figure that shows a good fit:

The documentation of `ppc_loo_pit_qq states:

For ppc_loo_pit_overlay() and ppc_loo_pit_qq(), optionally a vector of precomputed PIT values that can be specified instead of y, yrep, and lw (these are all ignored if pit is specified). If not specified the PIT values are computed internally before plotting

So I thought that I could avoid the included function by specifying the alternative arguments to the Bayesplot function and let it do the work (to hide away what you do not understand)

ppc_loo_pit_qq(
  y = roaches$y,
  yrep = posterior_predict(stan_glmnb),
  lw =  weights(loo_glmnb$psis_object),
  pit = NULL,
  compare = "uniform"
) + 
  geom_abline() + 
  ylim(c(0,1))

But this plots as:

So presumably the loo_pit function prepares the pit vector differently to what the unassisted Bayesplot function does. I expect I have missed something here but I have no idea what.

I haven’t had a chance to look into this thoroughly but I have a guess:

I think Aki’s notebook is using an updated version of the PIT calculation that’s better specifically for discrete data like yours.

bayesplot uses the loo_pit() function from our rstantools package, which has been updated on GitHub by @TeemuSailynoja to use the improved version, but it hasn’t been released on CRAN yet (that’s mostly my fault – I’ve been working on other packages and projects and I honestly forgot we had unreleased updates for rstantools). For now, if you want to install it from GitHub you can use:

# install.packages("remotes")
remotes::install_github("stan-dev/rstantools")

You’ll probably then need to restart your R session in order for bayesplot to use that version of rstantools that you just installed from GitHub.

If you try that let me know if that was indeed the issue.

1 Like

Yes, that fixes it. Thanks.

2 Likes