Loo: High Pareto k diagnostic values for beta binomial regression

I am trying to use loo to see if I could use a linear instead of a logit link function for a beta-binomial regression.

However, if I simulate data consistent with the generative model in the Stan code, I can recover the regression weights, but consistently get this warning message when using loo

Some Pareto k diagnostic values are too high. See help(‘pareto-k-diagnostic’) for details.

because all Pareto k diagnostic values are between 1 and Inf.

The code to reproduce this is below.

I would be grateful if somebody could have a look and explain what is going on (or point out an error in my code/thinking).

Simulate data in R:

library(boot)
library(VGAM)
library(rstan)
library(loo)
library(HDInterval)
options(mc.cores = 4)

set.seed(1234)

N = 100
K = 4
phi = 50
X = matrix(rnorm(N*K),ncol = K)
b = matrix(rnorm(K),ncol = 1)/2
p = inv.logit(-.8 + X %*% b)
Y = rbetabinom.ab(N,size = 22, shape1 = p*phi, shape2 = (1-p)*phi)
hist(Y,breaks = (0:23)-.5)
standata = list(N = N,
                K = K,
                X = X,
                Y = Y)

Stan model (logit link):

data { 
  int<lower=1> N;  // number of observations 
  int Y[N];        // response variable 
  int<lower=1> K;  // number of predictors
  matrix[N, K] X;  // design matrix
}

parameters { 
  vector[K] b;
  real<lower=0, upper=1> theta;
  real alpha;
}

transformed parameters {
  real<lower = 0> phi = (1-theta)/theta;
  vector[N] p = inv_logit(X * b + alpha);
}

model { 
  b ~ normal(0,2);
  alpha ~ normal(0,2);
  target += beta_binomial_lpmf(Y | 22, p * phi, (1-p) * phi);
} 

generated quantities {
  int y_rep[N];
  real log_lik[N];
  for (n in 1:N) {
    y_rep[n] = beta_binomial_rng(22, p[n] * phi, (1-p[n]) * phi);
    log_lik[n] = beta_binomial_lpmf(Y | 22, p[n] * phi, (1-p[n]) * phi);
  }
}

Sample, compare true and estimated b, try loo

sm_logit = stan_model("bebi_logit.stan")
sf_logit = sampling(sm_logit,
                    data = standata,
                    seed = 123)

## plot b with 90% HDIs of estimated b ##
b_hdis = hdi(extract(sf_logit,"b")$b,
             credMass = .90)
plot(0, type = "n",
     ylim = c(range(b_hdis)),
     xlim = c(1,4),
     xaxt = "n",ylab = "", xlab = "")
axis(1,1:4)
segments(x0 = 1:4,
         y0 = b_hdis[1,],
         y1 = b_hdis[,2])
points(1:4,b, pch = 16, cex = 2, col = "green4")

loo(extract_log_lik(sf_logit))

should be

log_lik[n] = beta_binomial_lpmf(Y[n] | 22, p[n] * phi, (1-p[n]) * phi);

The way I figured where the problem is that I checked the log_lik values. For discrete model with uniform probability for 22 categories lpmf would be \log(1/22)\approx -3.1, but lpmf’s from your code were around -461, so the were clear two orders of magnitude error in log scale! With the fixed code mean lpmf is about -2.3>-3.1, ie the model is beating the uniform distribution. loo works fine.

1 Like

Thanks!