Train/test Model Comparison Workflow

Alasdair_Clarke · September 26, 2024, 2:10pm

Hello,

I’d like to double-check my model comparison workflow. Ideally, we would use the lovely LOO/PSIS methods, but we haven’t had much luck with this project - our model is quite complicated, at least by my standards, and we end up with a lot of very slow computation and high pareto k statistics.

Instead, we are trying to implement a good old-fashioned train/test split.

Does this workflow look correct:

train_test_simple_FoMo <- function(df_train, ds_train, df_test, ds_test, 
                            FoMo_ver = "1.1",
                            features = c("spatial", "item_class"),
                            n_trials_to_sim = 1,
                            iter = 500) {
  
  
  d_train_list <- prep_data_for_stan(df_train, ds_train, features)
  d_train_list <- add_priors_to_d_list(d_train_list, modelver = FoMo_ver)
  d_train_list$n_trials_to_sim <- n_trials_to_sim
  
  mod <- cmdstan_model(paste0("../../models/simple/FoMo", str_replace("1.1", "\\.", "_" ), ".stan"))
  
  # train model
  m_train <- mod$sample(data = d_train_list, 
                    chains = 4, parallel_chains = 4, threads = 4,
                    refresh = 0, 
                    iter_warmup = iter, iter_sampling = iter,
                    sig_figs = 3)
  
  loglik_train <- m_train$draws("log_lik", format = "matrix")
  

  # test model
  d_test_list <- prep_data_for_stan(df_test, ds_test, model_components = features)
  d_test_list <- add_priors_to_d_list(d_test_list, modelver = FoMo_ver)
  d_test_list$n_trials_to_sim <- n_trials_to_sim
    
  
  gen_test <- mod$generate_quantities(m_train, data = d_test_list, seed = 123)
  loglik_test <- gen_test$draws("log_lik", format = "matrix")
  
  elpd_test <- elpd(loglik_test)
  
  return(list(model = m_train, 
              loglik = list(train = loglik_train, test = loglik_test), 
              elpd_test = elpd_test))
  
}


   
# model 1.0
model10 <- train_test_simple_FoMo(df_train, ds_train, df_test, ds_test, 
                            FoMo_ver = "1.0",
                            features = c("spatial", "item_class"),
                            n_trials_to_sim = 1,
                            iter = 500)

## model 1.1
model11 <- train_test_simple_FoMo(df_train, ds_train, df_test, ds_test, 
                            FoMo_ver = "1.1",
                            features = c("spatial", "item_class"),
                            n_trials_to_sim = 1,
                            iter = 500)

## model 1.2
model12 <- train_test_simple_FoMo(df_train, ds_train, df_test, ds_test, 
                            FoMo_ver = "1.2",
                            features = c("spatial", "item_class"),
                            n_trials_to_sim = 1,
                            iter = 500)

I can then run

loo_compare(model10$elpd_test, model11$elpd_test, model12$elpd_test)

       elpd_diff se_diff
model1  0.0       0.0   
model3  0.0       0.1   
model2 -0.2       0.1

Which seems fine? However, I’d quite like to compute model weights (assuming this is a sensible thing do to?) as I feel that these are easier for non-expert reviewers to understand?

If I do this, I get:

> loo_model_weights(list(model10$loglik$test, model11$loglik$test, model12$loglik$test))
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
Method: stacking
------
       weight
model1 0.035 
model2 0.000 
model3 0.965

Two questions, which might help resolve some my my faulty thinking :)

Am I right to be running loo_model_weights() and loo_compare()? My understanding was that loo was short for “leave on out”, and the loo methods that are generally used are doing some clever maths to estimate these values. Whereas I am trying to test my model more directly?
Can I ignore the pareto k diagnostic values in this case? What do they mean in this case?

Thanks hugely

Alasdair

Bob_Carpenter · September 26, 2024, 9:36pm

Yes, that’s right, though the math(s) to get it work out approximately isn’t too bad—it’s just an importance sampling trick.

If you’re really just doing K-fold cross validation, you want to divide the data into K “folds” (as they are called in cross-validation), train on K-1 of them and test on the remaining fold. You do that for all possible test folds. You can just do it with one held out test set, but that’ll be using 1/K times as much data for testing. At this point, leave-one-out isn’t relevant, so I’m guessing that those functions are not the right thing to call, but I don’t actually know what those functions do. Usually what you want to do to compute ELPD is compute it on the held out data, then add it up or average it across the folds. If you do ten-fold cross validation you can just compute standard error as the sample standard deviation of the estimates (assuming the underlying values are unbiased (in some sense—there’s an intrinsic bias in cross-validation), which I think they are here).

Anna_Hughes · September 27, 2024, 9:39am

Just to confirm (also working on this project…), the basic workflow is based on Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo - we do calculate the ELPD, but then there is a sentence at the end ‘if one wants to compare several models (with loo_compare ), one should use the same folds for all the different models.’ I think the difference between loo_compare() is that it’s working directly on the ELPDs, whereas loo_model_weights() uses the log likelihoods and then calls loo() internally (according to Model averaging/weighting via stacking or pseudo-BMA weighting — loo_model_weights • loo) but I’m not sure if the warnings we get with loo_model_weights() mean we should also worry about loo_compare()!

Bob_Carpenter · October 3, 2024, 9:55pm

@avehtari is probably the best person to ask about this. I don’t know what these internal functions inside of loo are doing.

avehtari · October 4, 2024, 7:07am

That help page also mentions “Currently the loo_model_weights() function is not implemented to be used with results from K-fold CV, but you can still obtain weights using K-fold CV results by calling the stacking_weights() function directly.” It should mention the same for pseudobma_weights() as can be seen from the function arguments in that same help page

stacking_weights(lpd_point, optim_method = "BFGS", optim_control = list())

pseudobma_weights(lpd_point, BB = TRUE, BB_n = 1000, alpha = 1)

(I’ll fix the doc)

Topic		Replies	Views
Model comparison (LOO?) from cmdstanr objects CmdStan loo , cmdstanr	4	443	February 8, 2024
Error when trying to compare brmsfit models with loo.brmsfit : Error: Backend 'rstan' is required for this method Modeling fitting-issues , specification , loo , brms	3	530	August 5, 2021
Comparing loo for two complex multivariate models with large data brms loo	9	1149	October 1, 2020
Multiple model comparison using LOOIC General	6	249	June 22, 2024
LOO Model Comparison Alternative Modeling rstan , techniques , loo , cmdstanr	3	80	March 27, 2025

Train/test Model Comparison Workflow

Related topics