- Operating System: Windows 10
- rstanarm Version: 2.18.2
I have a series of questions about the optimal use of loo and kfold to compare (or combine) the results of a nested set of stan_lmer models.
I have run a set of three nested hierarchical repeated measures stan_lmer models to study the overall intercept and slope over time of an analyte in a series of seven related clinical trials. There were 100+ subjects and 1000+ analyte measurements. It is recognized that the individual subjects in these seven trials will each have their own individual underlying intercepts and slopes. The focus is on the values for intercept and slope in the overall set of trials and on whether the intercepts and slopes differ among the seven trials. A side concerns identifying individuals with extreme values for either intercept or slope. (Sorry to be vague, but the specific data are confidential.)
model1 <- stan_lmer(analyte ~ testyrs + (testyrs | cpid), data = dataset) model2 <- stan_lmer(analyte ~ testyrs + StudyID + (testyrs | cpid), data = dataset) model3 <- stan_lmer(analyte ~ testyrs * StudyID + (testyrs | cpid), data = dataset)
I ran loo() on each of the models. There were in each case (9 - 15) observations that exceeded the k threshold of 0.7. So I ran both loo(model, k_threshold = 0.7) and kfold(model, 10) on each of the models, and then ran compare_models() and loo_model_weights() on the three models, with the following results:
compare_models(model1loo, model2loo, model3loo) Model comparison: (ordered by highest ELPD) elpd_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic model3loo 0.0 -7545.0 43.0 197.1 9.9 15090.0 85.9 model2loo -1.0 -7546.0 42.7 200.0 9.9 15092.1 85.4 model1loo -3.5 -7548.5 42.7 200.8 10.0 15097.0 85.3 compare_models(model1kfold, model2kfold, model3kfold) Model comparison: (ordered by highest ELPD) elpd_diff elpd_kfold se_elpd_kfold model3kfold 0.0 -7553.8 42.3 model2kfold -4.0 -7557.8 42.6 model1kfold -12.9 -7566.7 42.6 loolist <- list(model1loo$pointwise[,1], model2loo$pointwise[,1], model3loo$pointwise[,1]) loo::stacking_weights(loolist) Method: stacking ------ weight model1 0.000 model2 0.704 model3 0.296 kfoldlist <- loolist <- list(model1kfold$pointwise, model2kfold$pointwise, model3kfold$pointwise) loo::stacking_weights(kfoldlist) Method: stacking ------ weight model1 0.000 model2 0.455 model3 0.545
What is the correct question to ask of these analyses? When I do similar analyses using lme4::lmer, I typically ask whether the “model fit” is significantly improved with addition of the new parameters, using AIC or something similar. But I am given to understand that with the Bayesian approach, a better question is which model will give better predictions on future samples. This question is approached by the cross-validation approach taken by either loo() or kfold(). The above results suggest to me that none of the models is substantially superior to any of the others (elpd_diff << se_elpd_loo/kfold). Do I take this correctly to suggest that these results suggest that there is no robust difference in intercept or slope among the 7 trials? (In which case, the rest of my questions are probably not meaningful in this specific example.)
The recommendations from the loo() functions without the k_threshold parameter seem to recommend the use of loo() with the k_threshold or kfold depending on whether there are fewer than or more than 10 “bad” pareto cases. But when I tried to mix results using loo with k_threshold and kfold, my knuckles were cracked with the error message “Error: Can’t mix objects computed using ‘loo’, ‘waic’, and ‘kfold’.” Indeed, the size of the differences in elpd are about four times larger using kfold as compared with loo/k_threshold. This leads to the question, “Which analysis is more accurate/appropriate?” So far as I can see, the only basis recommended for selection of loo/k_threshold vs. kfold is processing time. Both approaches take so long to complete (at least for my models) that they have to be run overnight, or at least over a leisurely dinner. Is there a theoretical reason to choose one over the other?
Finally, how do I proceed now with these analyses? I got from McElreath that the most correct Bayesian way to deal with these results is to do model averaging. (And if so, how do I go about doing that? Do his functions work with rstanarm models?) But I sense that the Stan approach would be to use the “most complete model.” But in that case, why bother to do the cross-validation, and how do we avoid over-correction to the specific set of data being analyzed. In this case, predictions from models 2 and 3 are very similar (as one might guess from the small differences among them in elpd). But lets us assume that the next time around, model 2 is “much better” than models 1 and 3. What, at the end, do I report to the world about how different the intercepts and slopes in my study differ among the seven trials, and whether model 3 is “better” than model2 – or even than model1, for that matter?
I would dearly love some guidance from the Stan community about how to use these cross-validation functions. Many thanks in advance for your guidance.