K-fold validation for hierarchical model in rstan

Hi all,

I am trying to test two fairly complex hierarchical models against each other in rstan. The Pareto k values are high in some cases, so loo_compare cannot be used:

Computed from 4000 by 4336 log-likelihood matrix

------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     4210  97.1%   0         
 (0.5, 0.7]   (ok)         75   1.7%   0         
   (0.7, 1]   (bad)        40   0.9%   0         
   (1, Inf)   (very bad)   11   0.3%   0         
See help('pareto-k-diagnostic') for details.

This led me to k-fold CV. I got started following the very useful tutorial of Aki Vehtari (here). However, the tutorial shows how to do this in rstanarm, and my code would be difficult to translate to that. I suppose I could estimate the model manually for K different folds and somehow paste this back together for the cross-validation, but I crushed and burned during my attempts thus far.

Hence my question: can this be done in rstan, and if yes, are there any useful resources on how to get started in a semi-automated way that you could point me to?

Max

Looks like I was a little quick on the draw here. The key was in this vignette, not sure how I missed it in my initial search. After that, I made a few adaptations for the hierarchical structure. I include an example below:

#create the folds in the data
df$fold <- kfold_split_grouped(K = 10, x = df$id)

#create starting values for the modell parameters
starts <- list(alpha=1 , beta=1 )

# prepare matrix for the log pd:
log_pd_kfol <- matrix(nrow = 4000, ncol = nrow(df))

#create loop for selecting data folds and estimating model
for(k in 1:10){
 data_train <- list(x = df$x[d$fold != k],
                    y = df$y[d$fold != k],
                    z = df$z[d$fold != k],
                    N = nrow(d[df$fold != k,]),
                    id = df[d$fold != k,] %>% group_indices(id),
                    N_id = max(id)
                   )
  data_test <- list(cx = df$x[d$fold == k],
                    y = df$y[d$fold == k],
                    z = df$z[d$fold == k],
                    N = nrow(d[df$fold == k,]),
                    id = df[d$fold == k,] %>% group_indices(id),
                    N_id = max(id)
                   )
  fit <- sampling(  mymodel , data = data_train, init = list(starts,starts,starts,starts), iter=2000, warmup=1000, chains=4, cores=4 , save_warmup=FALSE , init_r = .1 , control = list(max_treedepth = 12, adapt_delta=0.99))
  gen_test <- gqs( mymodel , draws = as.matrix(fit), data= data_test)
  log_pd_kfold[ , df$fold == k] <- extract_log_lik(gen_test)
}

This would seem o be it, but I keep running into errors that all seem to have something to do with the gqs() part of the loop. I will create a separate post on that.

Cheers,
Max

When I was doing kfold cross validation with rstan, I found that I couldn’t have anything in the transformed parameters block or I’d get gqs errors about subsetting, so I wrote a separate version of my model with just the data, parameters, and generated quantities block to use in the kfold loop.

Leaving out transformed parameters and leaving the model{} part empty indeed did the trick. All I had o do was to move some additional parameter definitions from the transformed parameters block to the generated quantities block.

Thanks so much for your help!

Max

1 Like