Model selection with k-fold cross validation, considering Monte Carlo error

I have 24 models to fit to data, and I am doing k-fold cross validation for model selection. I could not use looic because Pareto k values were bad. I will pick the model with the highest expected log probability density (elpd) to show the model fit in my paper. I am also going to discuss models which are not significantly worse than the best model: I calculate z score = (elpd difference between one model and the best model) / (standard error), and if z-score < 2, I will assume this model is equally good as the best model.

One problem is that due to Monte carlo error, the best model can change in different samplings. So I decided to run k-fold cv for 100 times, and pick the best model.

Here I have questions:

  1. If I select the best model from 100 times cross cv for 24 models, should I use the model with the highest elpd or the model which become the best model the most frequently?
  2. Once the best model is selected, I will identify models which are equally good as the best model. Should I use all 2400 elpds (=from 24 models * 100 times kfold cross validations) to calculate z score using the best model? Or can I just use 100 elpds, the top ones from 100 times kfold cv, to calculate z score (again using the best model)?

Thank you for your discussion in advance!

Kanghcon

I just realized that I can just get elpd from all those k-fold cv * 100 times, to decrease monte carlo error by 10 fold. Neglecting the very small monte carlo errors, I can only take into account standard error due to the number of data points to calculate z score.

Hi,
@avehtari is the authority on this topic, but I will try to respond as best as I can and hope he corrects me if I got stuff wrong :-)

The biggest fear I have with your proposed scheme is that it only tackles the uncertainty arising from the different splits in K-fold, but not the uncertainty arising from the uncertainty in the data generating process itself.

Also, if you have the computing power to run 100 times the k-fold procedure, then you might as well have the computing power to compute exact leave-one-out cross validation (assuming your dataset has not much more than k * 100 data points). Or at least increase k noticeably. I would bet that in most cases having one run with large k is better then having many runs with small k (at least if predicting one future observation is the prediction task you care about).

In any case there are methods to compute elpd and its SE from exact k-fold/loo cross validation, which should IMHO be superior to the approach you propose, but they are not AFAIK well documented and are a bit opaque. The good part is that the results can then be interpreted in the same way as those from loo sparing you the kind of questions you are having.
The code I have used in my work is from StanCon: https://github.com/stan-dev/stancon_talks/blob/master/2017/Contributed-Talks/07_nicenboim/kfold.Rmd

Where the model computes the likelihood for held-out datapoints in generated quantities. See their models for the way they did it.
I had to update the R code a little to work with modern versions of the loo package to:

#Functions extract_log_lik and kfold taken from
#https://github.com/stan-dev/stancon_talks/blob/master/2017/Contributed-Talks/07_nicenboim/kfold.Rmd

extract_log_lik_K <- function(list_of_stanfits, list_of_holdout, ...){
  K <- length(list_of_stanfits)
  list_of_log_liks <- plyr::llply(1:K, function(k){
    extract_log_lik(list_of_stanfits[[k]],...)
  })
  # `log_lik_heldout` will include the loglike of all the held out data of all the folds.
  # We define `log_lik_heldout` as a (samples x N_obs) matrix
  # (similar to each log_lik matrix)
  log_lik_heldout <- list_of_log_liks[[1]] * NA
  for(k in 1:K){
    log_lik <- list_of_log_liks[[k]]
    samples <- dim(log_lik)[1]
    N_obs <- dim(log_lik)[2]
    # This is a matrix with the same size as log_lik_heldout
    # with 1 if the data was held out in the fold k
    heldout <- matrix(rep(list_of_holdout[[k]], each = samples), nrow = samples)
    # Sanity check that the previous log_lik is not being overwritten:
    if(any(!is.na(log_lik_heldout[heldout==1]))){
      warning("Heldout log_lik has been overwritten!!!!")
    }
    # We save here the log_lik of the fold k in the matrix:
    log_lik_heldout[heldout==1] <- log_lik[heldout==1]
  }
  return(log_lik_heldout)
}

kfold <- function(log_lik_heldout)  {
  library(matrixStats)
  logColMeansExp <- function(x) {
    # should be more stable than log(colMeans(exp(x)))
    S <- nrow(x)
    colLogSumExps(x) - log(S)
  }
  # See equation (20) of @VehtariEtAl2016
  pointwise <-  matrix(logColMeansExp(log_lik_heldout), ncol= 1)
  colnames(pointwise) <- "elpd"
  # See equation (21) of @VehtariEtAl2016
  elpd_kfold <- sum(pointwise)
  se_elpd_kfold <-  sqrt(ncol(log_lik_heldout) * var(pointwise))

  estimates <- matrix(NA_real_, nrow = 1, ncol = 2)
  rownames(estimates) <- "elpd_loo"
  colnames(estimates) <- c("Estimate","SE")
  estimates["elpd_loo", "Estimate"] <- elpd_kfold
  estimates["elpd_loo", "SE"] <- se_elpd_kfold

  out <- list(
    pointwise = pointwise,
    estimates = estimates
  )
  structure(out, class = "loo")
}


Also brms supports it: https://rdrr.io/cran/brms/man/kfold.brmsfit.html so you might be able to steal their code.

Hope that helps.

2 Likes