Obtain LOO by refitting high pareto values

Ariane1 · September 30, 2024, 12:01pm

Hi all!

When obtaining the loo-value for my rstan-fit using the loo-function, I obtain a few pareto-k values, which are larger than 0.7.

I read somewhere that it is possible to re-run the model by removing the observations corresponding to the large pareto-k values one at a time. However, I have not found how to do this exactly. How do I obtain the elpd values for these bad observations? And how do I then combine the correct elpd values from loo with the manually obtained values?

Thanks!

Alasdair_Clarke · September 30, 2024, 4:01pm

I took this approach for a recent paper. I’m not really a fan, it feels a bit like a hack, and often, you remove the worst points, refit the model, and you end up with a new set of problematic points.

I implemented this as a for/while loop in R. Essentially, after fitting the model, I checked if there were any points with pareto-k > 0.7. If there were, I’d remove the points, and refit the model.

you can extract the pareto_k values using:

 mloo <- m$loo(moment_match = TRUE)
 high_loo <- bind_rows(high_loo, d %>% filter(mloo$diagnostics$pareto_k > 0.7))

[edit: escaped R code]

Ariane1 · October 1, 2024, 8:24am

Hi, thanks for the fast reply!

I think I haven’t explained clearly what I mean.

I don’t want to remove outliers. What I rather want to do is the following:

LOO is performing leave-one-out cross-validation and elpd_{loo}=\sum_{i=1}^n \log p(y_i|y_{-i}).
LOO is obtaining elpd_i = \log p(y_i|y_{-i}) for each observation. However, we cannot trust these values for those i for which the pareto-k value is high.

Hence, my idea was to use the elpd_i values from LOO that can be trusted and manually obtain them when they can’t be trusted by removing observation i, fitting the model and estimating elpd_i.
I found out how to do this now.

Alasdair_Clarke · October 1, 2024, 8:39am

Ah,

that sounds useful. How did you do it?

Ariane1 · October 1, 2024, 11:51am

[edit: escaped R code]

I tried it with the example from Writing Stan programs for use with the loo package and it seems to work.

I first use the loo function and then also obtain the elpd_i values manually. In this case, all pareto k values are small so we can compare whether the two approaches give the same results.

# Define model
logisticstan = "
data {
  int<lower=0> N;                   // number of data points
  int<lower=0> P;                   // number of predictors (including intercept)
  matrix[N,P] X;                    // predictors (including 1s for intercept)
  array[N] int<lower=0,upper=1> y;  // binary outcome
}
parameters {
  vector[P] beta;
}
model {
  beta ~ normal(0, 1);
  y ~ bernoulli_logit(X * beta);
}
generated quantities {
  vector[N] log_lik;
  for (n in 1:N) {
    log_lik[n] = bernoulli_logit_lpmf(y[n] | X[n] * beta);
  }
}"

cat(logisticstan, file="logistic.stan")

library("rstan")

# Prepare data 
url <- "http://stat.columbia.edu/~gelman/arm/examples/arsenic/wells.dat"
wells <- read.table(url)
wells = wells[1:50,]
wells$dist100 <- with(wells, dist / 100)
X <- model.matrix(~ dist100 + arsenic, wells)
standata <- list(y = wells$switch, X = X, N = nrow(X), P = ncol(X))

# Fit model
fit <- stan("logistic.stan", data = standata)

# Obtain loo
loglik=extract_log_lik(fit, merge_chains=FALSE)
r_eff = relative_eff(exp(loglik), cores=1)
loo = loo(loglik, r_eff=r_eff, cores=1)

# Try to obtain LOO manually by removing the ith observation
bernoulli_logit = function(y,alpha){
  out=rep(NA,length(alpha))
  for(j in 1:length(alpha)){
    if(y==1){out[j]=1/(1+exp(-alpha[j]))}
    if(y==0){out[j]=1-1/(1+exp(-alpha[j]))}
  }
  return(out)
}
elpd=rep(NA,50)
for(i in 1:50){
  print(i)
  wells_i = wells[-i,]
  X_i <- model.matrix(~ dist100 + arsenic, wells_i)
  standata_i <- list(y = wells_i$switch, X = X_i, N = nrow(X_i), P = ncol(X_i))
  fit_minusi <- stan("logistic.stan", data = standata_i)
  mcmc_iters_i = extract(fit_minusi, permute=TRUE)
  beta_array = mcmc_iters_i$beta
  
  gen_i = bernoulli_logit(wells$switch[i], matrix(X[i,],1,3)%*%t(matrix(beta_array,4000,3)))
  
  elpd[i]=log(mean(gen_i))
}

# Comparing manually obtained elpd and the elpd by the loo function
cbind(elpd, loo$pointwise)

loo$estimates
sum(elpd)

Bob_Carpenter · October 3, 2024, 9:16pm

Cool idea, but that could get expensive! At some point, it’s probably easier to just do a 10-fold cross-validation directly (divide data into 10 roughly equal sizes and leave each one out in turn).

avehtari · October 4, 2024, 7:01am

rstanarm and brms can do “reloo” automatically as the model structure is constraint and it is possible to make the automatic convenience functions. If you write the model in Stan language and use RStan or CmdStanR, you need to do a bit of additional coding.

Yes, this is the best example we currently have. It doesn’t show the reloo approach completely, but shows how to compute the elpd’s given “training” and “test” data. At the moment, we don’t have a convenience function for manipulating loo objects, but you can do it manually as shown in the following draft code

loo1 <- loo(fit1)
loo1$pointwise[loo1$diagnostics$pareto_k>,1] <- reloo_pointwise

with this updated object, you could already do model comparison using loo_compare(loo1, loo2), but for the print(loo1) to show correct summaries, you would need to update also loo1$estimates array (right now I have not time to write all that, but let me know if you are not able to figure it out)

K-fold-CV is probably easier, but LOO-CV estimate has slightly lower variance, so it depending on the case it may be beneficial to do reloo even for more than K times.

Ariane1 · November 5, 2024, 1:29pm

Thank you very much for the replies, this worked well for me!

Another question regarding the estimate when having high pareto values: In the thread Interpret pareto k diagnostic - Modeling - The Stan Forums, you stated that “high Pareto k values indicate overoptimism”, which means that the (not trustable) LOO value suggests a better model fit than is actually the case. This means that it is enough to obtain the LOO manually for the better model because the LOO of the worse model can only get worse through the manual calculations, right?
Is there any reference I can use for this in my paper?

avehtari · November 12, 2024, 3:30pm

To be more accurate, the estimates value is likely to get worse, but there is some probability for other direction if the estimated values are close to each to begin with

Topic		Replies	Views
Max Farrell's tutorial says refit model without outliers k>0.7 Modeling loo	9	1362	October 22, 2021
Pareto values versus ELPD differences Modeling loo	5	1013	June 15, 2021
Loo_compare in the presence of high pareto-k brms loo	4	301	June 25, 2024
Model selection / LOO for sparse data Modeling loo	5	1707	September 9, 2018
High Pareto-k values for the same observations across different models: Can I still use loo to compare these models? Modeling loo	2	572	November 5, 2018

Obtain LOO by refitting high pareto values

Related topics