When obtaining the loo-value for my rstan-fit using the loo-function, I obtain a few pareto-k values, which are larger than 0.7.
I read somewhere that it is possible to re-run the model by removing the observations corresponding to the large pareto-k values one at a time. However, I have not found how to do this exactly. How do I obtain the elpd values for these bad observations? And how do I then combine the correct elpd values from loo with the manually obtained values?
I took this approach for a recent paper. I’m not really a fan, it feels a bit like a hack, and often, you remove the worst points, refit the model, and you end up with a new set of problematic points.
I implemented this as a for/while loop in R. Essentially, after fitting the model, I checked if there were any points with pareto-k > 0.7. If there were, I’d remove the points, and refit the model.
I think I haven’t explained clearly what I mean.
I don’t want to remove outliers. What I rather want to do is the following:
LOO is performing leave-one-out cross-validation and elpd_{loo}=\sum_{i=1}^n \log p(y_i|y_{-i}).
LOO is obtaining elpd_i = \log p(y_i|y_{-i}) for each observation. However, we cannot trust these values for those i for which the pareto-k value is high.
Hence, my idea was to use the elpd_i values from LOO that can be trusted and manually obtain them when they can’t be trusted by removing observation i, fitting the model and estimating elpd_i.
I found out how to do this now.
I first use the loo function and then also obtain the elpd_i values manually. In this case, all pareto k values are small so we can compare whether the two approaches give the same results.
# Define model
logisticstan = "
data {
int<lower=0> N; // number of data points
int<lower=0> P; // number of predictors (including intercept)
matrix[N,P] X; // predictors (including 1s for intercept)
array[N] int<lower=0,upper=1> y; // binary outcome
}
parameters {
vector[P] beta;
}
model {
beta ~ normal(0, 1);
y ~ bernoulli_logit(X * beta);
}
generated quantities {
vector[N] log_lik;
for (n in 1:N) {
log_lik[n] = bernoulli_logit_lpmf(y[n] | X[n] * beta);
}
}"
cat(logisticstan, file="logistic.stan")
library("rstan")
# Prepare data
url <- "http://stat.columbia.edu/~gelman/arm/examples/arsenic/wells.dat"
wells <- read.table(url)
wells = wells[1:50,]
wells$dist100 <- with(wells, dist / 100)
X <- model.matrix(~ dist100 + arsenic, wells)
standata <- list(y = wells$switch, X = X, N = nrow(X), P = ncol(X))
# Fit model
fit <- stan("logistic.stan", data = standata)
# Obtain loo
loglik=extract_log_lik(fit, merge_chains=FALSE)
r_eff = relative_eff(exp(loglik), cores=1)
loo = loo(loglik, r_eff=r_eff, cores=1)
# Try to obtain LOO manually by removing the ith observation
bernoulli_logit = function(y,alpha){
out=rep(NA,length(alpha))
for(j in 1:length(alpha)){
if(y==1){out[j]=1/(1+exp(-alpha[j]))}
if(y==0){out[j]=1-1/(1+exp(-alpha[j]))}
}
return(out)
}
elpd=rep(NA,50)
for(i in 1:50){
print(i)
wells_i = wells[-i,]
X_i <- model.matrix(~ dist100 + arsenic, wells_i)
standata_i <- list(y = wells_i$switch, X = X_i, N = nrow(X_i), P = ncol(X_i))
fit_minusi <- stan("logistic.stan", data = standata_i)
mcmc_iters_i = extract(fit_minusi, permute=TRUE)
beta_array = mcmc_iters_i$beta
gen_i = bernoulli_logit(wells$switch[i], matrix(X[i,],1,3)%*%t(matrix(beta_array,4000,3)))
elpd[i]=log(mean(gen_i))
}
# Comparing manually obtained elpd and the elpd by the loo function
cbind(elpd, loo$pointwise)
loo$estimates
sum(elpd)
Cool idea, but that could get expensive! At some point, it’s probably easier to just do a 10-fold cross-validation directly (divide data into 10 roughly equal sizes and leave each one out in turn).
rstanarm and brms can do “reloo” automatically as the model structure is constraint and it is possible to make the automatic convenience functions. If you write the model in Stan language and use RStan or CmdStanR, you need to do a bit of additional coding.
Yes, this is the best example we currently have. It doesn’t show the reloo approach completely, but shows how to compute the elpd’s given “training” and “test” data. At the moment, we don’t have a convenience function for manipulating loo objects, but you can do it manually as shown in the following draft code
with this updated object, you could already do model comparison using loo_compare(loo1, loo2), but for the print(loo1) to show correct summaries, you would need to update also loo1$estimates array (right now I have not time to write all that, but let me know if you are not able to figure it out)
K-fold-CV is probably easier, but LOO-CV estimate has slightly lower variance, so it depending on the case it may be beneficial to do reloo even for more than K times.
Thank you very much for the replies, this worked well for me!
Another question regarding the estimate when having high pareto values: In the thread Interpret pareto k diagnostic - Modeling - The Stan Forums, you stated that “high Pareto k values indicate overoptimism”, which means that the (not trustable) LOO value suggests a better model fit than is actually the case. This means that it is enough to obtain the LOO manually for the better model because the LOO of the worse model can only get worse through the manual calculations, right?
Is there any reference I can use for this in my paper?
To be more accurate, the estimates value is likely to get worse, but there is some probability for other direction if the estimated values are close to each to begin with