Obtain LOO by refitting high pareto values

Hi all!

When obtaining the loo-value for my rstan-fit using the loo-function, I obtain a few pareto-k values, which are larger than 0.7.

I read somewhere that it is possible to re-run the model by removing the observations corresponding to the large pareto-k values one at a time. However, I have not found how to do this exactly. How do I obtain the elpd values for these bad observations? And how do I then combine the correct elpd values from loo with the manually obtained values?

Thanks!

1 Like

I took this approach for a recent paper. I’m not really a fan, it feels a bit like a hack, and often, you remove the worst points, refit the model, and you end up with a new set of problematic points.

I implemented this as a for/while loop in R. Essentially, after fitting the model, I checked if there were any points with pareto-k > 0.7. If there were, I’d remove the points, and refit the model.

you can extract the pareto_k values using:

 mloo <- m$loo(moment_match = TRUE)
 high_loo <- bind_rows(high_loo, d %>% filter(mloo$diagnostics$pareto_k > 0.7))

[edit: escaped R code]

Hi, thanks for the fast reply!

I think I haven’t explained clearly what I mean.

I don’t want to remove outliers. What I rather want to do is the following:

LOO is performing leave-one-out cross-validation and elpd_{loo}=\sum_{i=1}^n \log p(y_i|y_{-i}).
LOO is obtaining elpd_i = \log p(y_i|y_{-i}) for each observation. However, we cannot trust these values for those i for which the pareto-k value is high.

Hence, my idea was to use the elpd_i values from LOO that can be trusted and manually obtain them when they can’t be trusted by removing observation i, fitting the model and estimating elpd_i.
I found out how to do this now.

Ah,

that sounds useful. How did you do it?

[edit: escaped R code]

I tried it with the example from Writing Stan programs for use with the loo package and it seems to work.

I first use the loo function and then also obtain the elpd_i values manually. In this case, all pareto k values are small so we can compare whether the two approaches give the same results.

# Define model
logisticstan = "
data {
  int<lower=0> N;                   // number of data points
  int<lower=0> P;                   // number of predictors (including intercept)
  matrix[N,P] X;                    // predictors (including 1s for intercept)
  array[N] int<lower=0,upper=1> y;  // binary outcome
}
parameters {
  vector[P] beta;
}
model {
  beta ~ normal(0, 1);
  y ~ bernoulli_logit(X * beta);
}
generated quantities {
  vector[N] log_lik;
  for (n in 1:N) {
    log_lik[n] = bernoulli_logit_lpmf(y[n] | X[n] * beta);
  }
}"

cat(logisticstan, file="logistic.stan")

library("rstan")

# Prepare data 
url <- "http://stat.columbia.edu/~gelman/arm/examples/arsenic/wells.dat"
wells <- read.table(url)
wells = wells[1:50,]
wells$dist100 <- with(wells, dist / 100)
X <- model.matrix(~ dist100 + arsenic, wells)
standata <- list(y = wells$switch, X = X, N = nrow(X), P = ncol(X))

# Fit model
fit <- stan("logistic.stan", data = standata)

# Obtain loo
loglik=extract_log_lik(fit, merge_chains=FALSE)
r_eff = relative_eff(exp(loglik), cores=1)
loo = loo(loglik, r_eff=r_eff, cores=1)

# Try to obtain LOO manually by removing the ith observation
bernoulli_logit = function(y,alpha){
  out=rep(NA,length(alpha))
  for(j in 1:length(alpha)){
    if(y==1){out[j]=1/(1+exp(-alpha[j]))}
    if(y==0){out[j]=1-1/(1+exp(-alpha[j]))}
  }
  return(out)
}
elpd=rep(NA,50)
for(i in 1:50){
  print(i)
  wells_i = wells[-i,]
  X_i <- model.matrix(~ dist100 + arsenic, wells_i)
  standata_i <- list(y = wells_i$switch, X = X_i, N = nrow(X_i), P = ncol(X_i))
  fit_minusi <- stan("logistic.stan", data = standata_i)
  mcmc_iters_i = extract(fit_minusi, permute=TRUE)
  beta_array = mcmc_iters_i$beta
  
  gen_i = bernoulli_logit(wells$switch[i], matrix(X[i,],1,3)%*%t(matrix(beta_array,4000,3)))
  
  elpd[i]=log(mean(gen_i))
}

# Comparing manually obtained elpd and the elpd by the loo function
cbind(elpd, loo$pointwise)

loo$estimates
sum(elpd)

Cool idea, but that could get expensive! At some point, it’s probably easier to just do a 10-fold cross-validation directly (divide data into 10 roughly equal sizes and leave each one out in turn).

rstanarm and brms can do “reloo” automatically as the model structure is constraint and it is possible to make the automatic convenience functions. If you write the model in Stan language and use RStan or CmdStanR, you need to do a bit of additional coding.

Yes, this is the best example we currently have. It doesn’t show the reloo approach completely, but shows how to compute the elpd’s given “training” and “test” data. At the moment, we don’t have a convenience function for manipulating loo objects, but you can do it manually as shown in the following draft code

loo1 <- loo(fit1)
loo1$pointwise[loo1$diagnostics$pareto_k>,1] <- reloo_pointwise

with this updated object, you could already do model comparison using loo_compare(loo1, loo2), but for the print(loo1) to show correct summaries, you would need to update also loo1$estimates array (right now I have not time to write all that, but let me know if you are not able to figure it out)

K-fold-CV is probably easier, but LOO-CV estimate has slightly lower variance, so it depending on the case it may be beneficial to do reloo even for more than K times.

1 Like