K-fold cv unreliable results

Dear Stan users,
I’m running some models with different predictors and I’d like to compare them.
The models are hurdle models (one component is bernoulli_logit and the other one is beta) and they run very well using non-centered parameterization: meaningful results, good diagnostics, no warnings. N is ~190 for bernoulli and ~160 for beta.

The log_lik is computed following Aki’s suggestion here Log likelihood for hurdle model

The Pareto k diagnostic values that I obtained from the loo package are not very good: I have between 10% and 15% of values greater than 0.7, depending on the model. Following “Vehtari, Gelman, & Gabry, 2016, Practical Bayesian model…” I decided to use k-fold cross validation, which is more robust than waic and psis-loo.

Here is my question: 10-fold cv resulted to be unreliable, that is, I run it twice and I obtained different results (in one case a model was better than the other and in another case it was worse). Is there anything else I can do to get more reliable model comparisons information? Would k > 10 help? Can I trust more results of model comparisons using a k closer to N?

Thank you for any suggestion you may have.
Luca

Can you tell how big the difference is and how much there is uncertainty in the difference? If the difference is small, then you may observe variation due to Monte Carlo error (and/or variability due to the data set division in 10-fold-cv if it’s not same all the time). These variations should usually be smaller than the overall uncertainty computed as in the paper you refer.

Theory says that for stable predictors k closer to N has smaller variance A survey of cross-validation procedures for model selection. I have observed similar result (also mentioned in that Vehtari, Gelman, & Gabry, 2016 paper), but usually this difference is not big enough to affect the conclusions.

Thank you for your answer, Aki. Here are elpd and se for the two models and the two runs:

10-fold cv, first run:
model elpd elpd_se
m8c -19.04 15.68
m8n -34.63 19.30

10-fold cv, second run:
model elpd elpd_se
m8n -25.67 18.16
m8c -28.39 18.36

The difference between m8n and m8c in the second run is 2.7, SE = 21.1.
For the first run, the difference is 15.6, but unfortunately I can’t get the SE because I overwrote the pointwise data (my bad I didn’t set the seed for the data division, so I can’t replicate that output). I was aware of the relatively small sample and that there could be variability related to the data set division, but I didn’t expect big differences between the two runs.
My guess would be that the difference between the two runs is mainly related to differences in data division, which was random, stratified on the dependent variable (high vs low).

eldd_se’s are quite large…

…but even more surprising is that SE of the difference is larger than SEs for the single models. This hints that there might be something wrong with your 10-fold-cv computation. It’s not possible to say more without more information.

For the k-fold computations, I adapted the code from
Vasishth, Nicenboim, Chopin, Ryder (2017). Bayesian hierarchical finite mixture models of reading times: A case study. PsyArXiv. https://doi.org/10.17605/OSF.IO/A4HS9

The code from that paper is pasted below. I also double checked my script and things look ok to me.
I’ll try k closer to N to see if I can at least get similar results from different runs.

# given data in the following format:
#> head(headnoun[,c(1,2,3,7,10,11)])
# subj item type rt so SO
#94 1 13 obj-ext 1561 0.5 0
#221 1 6 subj-ext 959 -0.5 1
#341 1 5 obj-ext 582 0.5 0
row.names(headnoun)<-1:dim(headnoun)[1]
headnoun$row<-row.names(headnoun)
K <- 10
d <- headnoun
G <- list()
for (i in 1:K) {
  G[[i]] <- sample_frac(group_by(d, subj), (1/(K + 1 - i)))
  G[[i]]$k <- i
  d <<- anti_join(d, G[[i]], by = c("subj", "item", "type", "rt","so","SO"))
}

# We create a data-frame again:
dK <- bind_rows(G)
# We save the order of the dataframe
dK <- dK[order(dK$row), ]
ldata <- plyr::llply(1:K, function(i) {
  list(N = nrow(dK), rt = dK$rt,
  so = dK$so,
  SO = dK$SO,
  subj = as.numeric(as.factor(dK$subj)),
  J = length(unique(dK$subj)),
  item = as.numeric(as.factor(dK$item)),
  K = length(unique(dK$item)),
  heldout = ifelse(dK$k == i, 1, 0))
})

## Model 0:
pointwisem0 <- list()
for(i in 1:10){
  hnxval.dat<-ldata[[i]]
  m0xval <- stan(file = "models/m0xval.stan",
  data = hnxval.dat,
  iter = 4000, chains = 4,
  refresh=0)
  m0xval<-extract(m0xval,pars="log_lik")
  loglik<-m0xval$log_lik
  hldout<-which(hnxval.dat$heldout==1)
  logmeans<-rep(NA,length(hldout))
  for(j in 1:length(hldout)){
    logmeans[j]<-log(mean(exp(loglik[,hldout[j]])))
  }
  pointwisem0[[i]]<-logmeans
}

pointwisem0_flat<-Reduce(c,pointwisem0)
(elpdm0<-sum(pointwisem0_flat))
(elpdm0_se<-sqrt(length(pointwisem0_flat)*
var(pointwisem0_flat)))

## Model 2:
[..]

## k-fold cv:

(kfoldelpddiff<-elpdm2-elpdm0)
(kfoldelpdse<-sqrt(length(pointwisem0_flat)*
var(pointwisem0_flat-
pointwisem2_flat)))

I checked the paper and the code. Most parts I could understand it’s likely that they are correct, but some parts I couldn’t understand.

Could you plot pointwisem0_flat, pointwisem2_flat and the difference?

thank you, Aki. Below are the graphs of the pointwise data for the two models mentioned in the previous posts (what in Vasishth’s code is called pointwise*_flat). For this new run (4 chains, iter=2000 + warmup=1000, a bit less than previous runs), the elpd values are:

model elpd elpd_se
m8c -33.25 18.48
m8n -36.61 18.91

and the difference is -3.36, SE = 22.9

The graphs are below (m8c, m8n, and m8c minus m8n respectively). The first ~ 195 points belong the bernoulli component of the hurdle model, and the remaining points to the beta component.

m8c

m8n

m8c_m8n

Are you certain that you have the same ordering of data for both models?

Thank you so much Aki … I double checked my code and found a couple of lines where I prepared the independent variables for the different models that had an effect on the ordering of the data.
Below are estimates from three new 10-fold cv runs with corrected code.
There is still some variability in elpd, which could really be about differences in data division, but SEs are much smaller. In any case, they all point to a non relevant difference between the two models (elpd_diff is always smaller that its SE).

Would you consider acceptable the differences between the three runs, or should I still try with k > 10 (20, 40?) ?

## 10-fold cv, first run

    elpd_kfold se_elpd_kfold   
m8c_cv -33.60      19.19
m8n_cv -34.22      19.27        
> print(compare(m8n_cv, m8c_cv), digits = 2)
elpd_diff        se
     0.62      3.49


## 10-fold cv, second run

    elpd_kfold se_elpd_kfold   
m8c_cv -36.42      19.33
m8n_cv -37.38      19.77         
> print(compare(m8n_cv, m8c_cv), digits = 2)
elpd_diff        se
     0.96      3.70


## 10-fold cv, third run

    elpd_kfold se_elpd_kfold   
m8c_cv -29.75      18.23
m8n_cv -32.57      18.61
> print(compare(m8n_cv, m8c_cv), digits = 2)
elpd_diff        se
     2.82      3.74

Great!

The results are sensible now, and the results from different runs are also sensible, No need to try with larger k. There is no relevant difference.

thank you very much again, Aki!
Not sure I would have detected that data ordering problem if you didn’t ask.

Luca

There were two hints, which could be useful for for others to know also

  1. se of the difference was larger than the se for single models, this implies close to zero correlation between pointwise values, while it’s usual that there is high correlation
  2. in your pointwise plots the point close to -8 has different index for different models, while it’s likely that a observation which is difficult to predict with one model, is also difficult to predict with other model

Aki

1 Like