K-fold validation for hierarchical model in rstan

maxmontana · June 24, 2021, 2:32pm

Hi all,

I am trying to test two fairly complex hierarchical models against each other in rstan. The Pareto k values are high in some cases, so loo_compare cannot be used:

Computed from 4000 by 4336 log-likelihood matrix

------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     4210  97.1%   0         
 (0.5, 0.7]   (ok)         75   1.7%   0         
   (0.7, 1]   (bad)        40   0.9%   0         
   (1, Inf)   (very bad)   11   0.3%   0         
See help('pareto-k-diagnostic') for details.

This led me to k-fold CV. I got started following the very useful tutorial of Aki Vehtari (here). However, the tutorial shows how to do this in rstanarm, and my code would be difficult to translate to that. I suppose I could estimate the model manually for K different folds and somehow paste this back together for the cross-validation, but I crushed and burned during my attempts thus far.

Hence my question: can this be done in rstan, and if yes, are there any useful resources on how to get started in a semi-automated way that you could point me to?

Max

maxmontana · June 24, 2021, 6:38pm

Looks like I was a little quick on the draw here. The key was in this vignette, not sure how I missed it in my initial search. After that, I made a few adaptations for the hierarchical structure. I include an example below:

#create the folds in the data
df$fold <- kfold_split_grouped(K = 10, x = df$id)

#create starting values for the modell parameters
starts <- list(alpha=1 , beta=1 )

# prepare matrix for the log pd:
log_pd_kfol <- matrix(nrow = 4000, ncol = nrow(df))

#create loop for selecting data folds and estimating model
for(k in 1:10){
 data_train <- list(x = df$x[d$fold != k],
                    y = df$y[d$fold != k],
                    z = df$z[d$fold != k],
                    N = nrow(d[df$fold != k,]),
                    id = df[d$fold != k,] %>% group_indices(id),
                    N_id = max(id)
                   )
  data_test <- list(cx = df$x[d$fold == k],
                    y = df$y[d$fold == k],
                    z = df$z[d$fold == k],
                    N = nrow(d[df$fold == k,]),
                    id = df[d$fold == k,] %>% group_indices(id),
                    N_id = max(id)
                   )
  fit <- sampling(  mymodel , data = data_train, init = list(starts,starts,starts,starts), iter=2000, warmup=1000, chains=4, cores=4 , save_warmup=FALSE , init_r = .1 , control = list(max_treedepth = 12, adapt_delta=0.99))
  gen_test <- gqs( mymodel , draws = as.matrix(fit), data= data_test)
  log_pd_kfold[ , df$fold == k] <- extract_log_lik(gen_test)
}

This would seem o be it, but I keep running into errors that all seem to have something to do with the gqs() part of the loop. I will create a separate post on that.

Cheers,
Max

scisus · June 27, 2021, 11:43pm

When I was doing kfold cross validation with rstan, I found that I couldn’t have anything in the transformed parameters block or I’d get gqs errors about subsetting, so I wrote a separate version of my model with just the data, parameters, and generated quantities block to use in the kfold loop.

maxmontana · June 28, 2021, 12:03pm

Leaving out transformed parameters and leaving the model{} part empty indeed did the trick. All I had o do was to move some additional parameter definitions from the transformed parameters block to the generated quantities block.

Thanks so much for your help!

Max

Jesse_Birchfield · March 29, 2023, 4:17am

I am having the same problem. Could you please explain the solution? In what way are you leaving out transformed parameters and leaving the model block empty? How could you fit the model that way?

Topic		Replies	Views
Model Comparison when high Pareto k? General rstan , loo	5	135	June 29, 2025
LOO Model Comparison Alternative Modeling rstan , techniques , loo , cmdstanr	3	89	March 27, 2025
LOOIC diagnostics with Stanfit object Modeling loo	5	650	October 30, 2020
Loo+rstanarm for hierarchical models rstanarm loo	1	965	April 18, 2018
Kfold returning NaN rstanarm loo	13	3155	September 19, 2017

K-fold validation for hierarchical model in rstan

Related topics