# Model selection with k-fold cross validation, considering Monte Carlo error

I have 24 models to fit to data, and I am doing k-fold cross validation for model selection. I could not use looic because Pareto k values were bad. I will pick the model with the highest expected log probability density (elpd) to show the model fit in my paper. I am also going to discuss models which are not significantly worse than the best model: I calculate z score = (elpd difference between one model and the best model) / (standard error), and if z-score < 2, I will assume this model is equally good as the best model.

One problem is that due to Monte carlo error, the best model can change in different samplings. So I decided to run k-fold cv for 100 times, and pick the best model.

Here I have questions:

1. If I select the best model from 100 times cross cv for 24 models, should I use the model with the highest elpd or the model which become the best model the most frequently?
2. Once the best model is selected, I will identify models which are equally good as the best model. Should I use all 2400 elpds (=from 24 models * 100 times kfold cross validations) to calculate z score using the best model? Or can I just use 100 elpds, the top ones from 100 times kfold cv, to calculate z score (again using the best model)?

Kanghcon

I just realized that I can just get elpd from all those k-fold cv * 100 times, to decrease monte carlo error by 10 fold. Neglecting the very small monte carlo errors, I can only take into account standard error due to the number of data points to calculate z score.

Hi,
@avehtari is the authority on this topic, but I will try to respond as best as I can and hope he corrects me if I got stuff wrong :-)

The biggest fear I have with your proposed scheme is that it only tackles the uncertainty arising from the different splits in K-fold, but not the uncertainty arising from the uncertainty in the data generating process itself.

Also, if you have the computing power to run 100 times the k-fold procedure, then you might as well have the computing power to compute exact leave-one-out cross validation (assuming your dataset has not much more than k * 100 data points). Or at least increase k noticeably. I would bet that in most cases having one run with large k is better then having many runs with small k (at least if predicting one future observation is the prediction task you care about).

In any case there are methods to compute elpd and its SE from exact k-fold/loo cross validation, which should IMHO be superior to the approach you propose, but they are not AFAIK well documented and are a bit opaque. The good part is that the results can then be interpreted in the same way as those from `loo` sparing you the kind of questions you are having.
The code I have used in my work is from StanCon: https://github.com/stan-dev/stancon_talks/blob/master/2017/Contributed-Talks/07_nicenboim/kfold.Rmd

Where the model computes the likelihood for held-out datapoints in `generated quantities`. See their models for the way they did it.
I had to update the R code a little to work with modern versions of the `loo` package to:

``````#Functions extract_log_lik and kfold taken from
#https://github.com/stan-dev/stancon_talks/blob/master/2017/Contributed-Talks/07_nicenboim/kfold.Rmd

extract_log_lik_K <- function(list_of_stanfits, list_of_holdout, ...){
K <- length(list_of_stanfits)
list_of_log_liks <- plyr::llply(1:K, function(k){
extract_log_lik(list_of_stanfits[[k]],...)
})
# `log_lik_heldout` will include the loglike of all the held out data of all the folds.
# We define `log_lik_heldout` as a (samples x N_obs) matrix
# (similar to each log_lik matrix)
log_lik_heldout <- list_of_log_liks[] * NA
for(k in 1:K){
log_lik <- list_of_log_liks[[k]]
samples <- dim(log_lik)
N_obs <- dim(log_lik)
# This is a matrix with the same size as log_lik_heldout
# with 1 if the data was held out in the fold k
heldout <- matrix(rep(list_of_holdout[[k]], each = samples), nrow = samples)
# Sanity check that the previous log_lik is not being overwritten:
if(any(!is.na(log_lik_heldout[heldout==1]))){
warning("Heldout log_lik has been overwritten!!!!")
}
# We save here the log_lik of the fold k in the matrix:
log_lik_heldout[heldout==1] <- log_lik[heldout==1]
}
return(log_lik_heldout)
}

kfold <- function(log_lik_heldout)  {
library(matrixStats)
logColMeansExp <- function(x) {
# should be more stable than log(colMeans(exp(x)))
S <- nrow(x)
colLogSumExps(x) - log(S)
}
# See equation (20) of @VehtariEtAl2016
pointwise <-  matrix(logColMeansExp(log_lik_heldout), ncol= 1)
colnames(pointwise) <- "elpd"
# See equation (21) of @VehtariEtAl2016
elpd_kfold <- sum(pointwise)
se_elpd_kfold <-  sqrt(ncol(log_lik_heldout) * var(pointwise))

estimates <- matrix(NA_real_, nrow = 1, ncol = 2)
rownames(estimates) <- "elpd_loo"
colnames(estimates) <- c("Estimate","SE")
estimates["elpd_loo", "Estimate"] <- elpd_kfold
estimates["elpd_loo", "SE"] <- se_elpd_kfold

out <- list(
pointwise = pointwise,
estimates = estimates
)
structure(out, class = "loo")
}

``````

Also brms supports it: https://rdrr.io/cran/brms/man/kfold.brmsfit.html so you might be able to steal their code.

Hope that helps.

2 Likes