Loo with k_threshold parameter vs. kfold for comparing rstanarm models

  • Operating System: Windows 10
  • rstanarm Version: 2.18.2

I have a series of questions about the optimal use of loo and kfold to compare (or combine) the results of a nested set of stan_lmer models.

I have run a set of three nested hierarchical repeated measures stan_lmer models to study the overall intercept and slope over time of an analyte in a series of seven related clinical trials. There were 100+ subjects and 1000+ analyte measurements. It is recognized that the individual subjects in these seven trials will each have their own individual underlying intercepts and slopes. The focus is on the values for intercept and slope in the overall set of trials and on whether the intercepts and slopes differ among the seven trials. A side concerns identifying individuals with extreme values for either intercept or slope. (Sorry to be vague, but the specific data are confidential.)

model1  <- stan_lmer(analyte ~ testyrs + (testyrs | cpid), data = dataset)
model2  <- stan_lmer(analyte ~ testyrs + StudyID + (testyrs | cpid), data = dataset)
model3  <- stan_lmer(analyte ~ testyrs * StudyID + (testyrs | cpid), data = dataset)

I ran loo() on each of the models. There were in each case (9 - 15) observations that exceeded the k threshold of 0.7. So I ran both loo(model, k_threshold = 0.7) and kfold(model, 10) on each of the models, and then ran compare_models() and loo_model_weights() on the three models, with the following results:

compare_models(model1loo, model2loo, model3loo)
Model comparison: 
(ordered by highest ELPD)

                 elpd_diff  elpd_loo se_elpd_loo p_loo   se_p_loo   looic     se_looic
model3loo         0.0   -7545.0     43.0         197.1      9.9        15090.0    85.9 
model2loo        -1.0   -7546.0     42.7         200.0      9.9        15092.1    85.4 
model1loo        -3.5   -7548.5     42.7         200.8    10.0        15097.0    85.3 

compare_models(model1kfold, model2kfold, model3kfold)

Model comparison: 
(ordered by highest ELPD)

                 elpd_diff elpd_kfold   se_elpd_kfold
model3kfold      0.0    -7553.8       42.3      
model2kfold     -4.0    -7557.8       42.6      
model1kfold   -12.9    -7566.7       42.6     

loolist <- list(model1loo$pointwise[,1],  model2loo$pointwise[,1], model3loo$pointwise[,1])
loo::stacking_weights(loolist)

Method: stacking
------
       weight
model1 0.000 
model2 0.704 
model3 0.296  

kfoldlist <- loolist <- list(model1kfold$pointwise,  model2kfold$pointwise, model3kfold$pointwise)
loo::stacking_weights(kfoldlist)

Method: stacking
------
       weight
model1 0.000 
model2 0.455 
model3 0.545 

My questions:

  1. What is the correct question to ask of these analyses? When I do similar analyses using lme4::lmer, I typically ask whether the “model fit” is significantly improved with addition of the new parameters, using AIC or something similar. But I am given to understand that with the Bayesian approach, a better question is which model will give better predictions on future samples. This question is approached by the cross-validation approach taken by either loo() or kfold(). The above results suggest to me that none of the models is substantially superior to any of the others (elpd_diff << se_elpd_loo/kfold). Do I take this correctly to suggest that these results suggest that there is no robust difference in intercept or slope among the 7 trials? (In which case, the rest of my questions are probably not meaningful in this specific example.)

  2. The recommendations from the loo() functions without the k_threshold parameter seem to recommend the use of loo() with the k_threshold or kfold depending on whether there are fewer than or more than 10 “bad” pareto cases. But when I tried to mix results using loo with k_threshold and kfold, my knuckles were cracked with the error message “Error: Can’t mix objects computed using ‘loo’, ‘waic’, and ‘kfold’.” Indeed, the size of the differences in elpd are about four times larger using kfold as compared with loo/k_threshold. This leads to the question, “Which analysis is more accurate/appropriate?” So far as I can see, the only basis recommended for selection of loo/k_threshold vs. kfold is processing time. Both approaches take so long to complete (at least for my models) that they have to be run overnight, or at least over a leisurely dinner. Is there a theoretical reason to choose one over the other?

  3. Finally, how do I proceed now with these analyses? I got from McElreath that the most correct Bayesian way to deal with these results is to do model averaging. (And if so, how do I go about doing that? Do his functions work with rstanarm models?) But I sense that the Stan approach would be to use the “most complete model.” But in that case, why bother to do the cross-validation, and how do we avoid over-correction to the specific set of data being analyzed. In this case, predictions from models 2 and 3 are very similar (as one might guess from the small differences among them in elpd). But lets us assume that the next time around, model 2 is “much better” than models 1 and 3. What, at the end, do I report to the world about how different the intercepts and slopes in my study differ among the seven trials, and whether model 3 is “better” than model2 – or even than model1, for that matter?

I would dearly love some guidance from the Stan community about how to use these cross-validation functions. Many thanks in advance for your guidance.

Larry Hunsicker

More specifically, it is which model is expected to give better predictions for future samples, as judged by log-predictive density? In your case, they are all expected to be about the same.

If some of the estimated Pareto k-values are high, then explicit leave-each-of-those-out-cross-validation is theoretically better because the PSIS assumptions are not satisfied.

If you cared about the predictions per se, then stacking gives you the optimal weights on the predictions of the individual models. If really what you care about are the estimated coefficients, then I would report the results for each of them and say that there was no strong reason to prefer one over the other on predictive grounds. If you want to only report one, I would do the most general one (the third).

1 Like

Many thanks, Ben. This is very helpful.
Larry Hunsicker

Since your models include (testyrs | cpid), and if some groups have a small number of observations then random kfold can remove a big part of them and the result can be quite different to loo. It would be better to do balanced k-fold with balancing with respect to groups. Unfortunately our current helper function kfold_split_balanced works only for binary grouping, but you could do the fold grouping yourself so that you try to reomve 1/k of observation in each group as evenly as possible.

Currently compare_models doesn’t show se_elpd_diff ig thre are more than two models. You can run compare_models separately

compare_models(model1loo, model3loo)
compare_models(model2loo, model3loo)

to get pairwise comparison SEs. The differences are that small that it is likely that they are small compared to se_elpd_diff, too, but it’s good to check.

What stacking is adding here, is that model1 is likely quite similar to model2 but a bit worse, while model2 and model3 are a bit different and thus mixture of those is better than one of them alone.

Akaike derived AIC as estimate for predictive performance when maximum likelihood is used, so it answers also the predictive performance question but for difference inference. Later AIC explanations over-simplified this to fit + complexity.

You need to compare using se_elpd_diff, but the conclusion is likely to stay the same.

It’s better not to mix, unless you are certain what you are doing. It’s possible to manually get individual elpd values for the comparison, but that’s not probably necessary here.

Yes.

Stacking. Which btw shows that it’s enough to include 2 and 3 in your average.

That is likely to be very close in performance compare to stacked model, but easier to analyse.

I don’t understand this.

If in the model averaging the weights of two models are 0, then the one model with weight 1 is sufficient.

Lawrence_Hunsicker:
But in that case, why bother to do the cross-validation, and how do we avoid over-correction to the specific set of data being analyzed.


avehtari
I don’t understand this.

Larry Hunsicker

I guess that I didn’t understand it, either. Let me try again. One checks the cross-validation to be sure that one hasn’t “over adjusted.” If the more complicated model is much worse at cross-validation than a less complicated model, obviously one would reject it. I guess that Ben’s point was that if two models are essentially as good as one another, and if there was interest in the added variable(s) of the more complicated model, there would be no problem in reporting the more complicated one. In my case, I looked at the more complicated model. It did suggest some minor differences among the trials in slope. But those differences were pretty small compared with their MADs. So we get an answer, and, “Surprise, surprise!” it shows that there isn’t much difference.

Lawrence_Hunsicker:
(And if so, how do I go about doing that? Do his functions work with rstanarm models?)

avehtari
Stacking. Which btw shows that it’s enough to include 2 and 3 in your average.

Larry Hunsicker

Yes. Stacking shows the appropriate weights. But what do we weight average? The parameters? If one model lacks a parameter present in another model, does one just assign it a value of 0? I have never actually done weight averaging other than by using McElreath’s function. Does that function work with stanreg models?

Thanks, Aki, and thanks, again, Ben. You guys are impressive both in what you have done for the rest of us, and in your willingness to help.

Larry Hunsicker

Yes.

In case of correlated predictors, the marginal posterior distributions can be misleading. This case study Bayesian version of Does model averaging make sense? shows an example where all marginal effects are small compared to the marginal posterior variance, but jointly they are useful.

Not parameters, but the predictive distributions.