Model selection with loo_compare and loo_model_weights

I am conducting an analysis of various effects on the harvest of different fish species. Each species is analyzed independent using the same set of 8 nested models. I will not be using the results for predictions, I am trying to capture those predictors which modify the response variable and in what direction.
It was suggested to me to use both loo_compare and loo_model_weights for model selection. For some of the species, I have come across the situation where the model selected by loo_compare is not the same as the model with the highest stacking weight by loo_model_weights and do not know how best to proceed with reporting results and discussion.

I have read some papers and the various documentation, but it is still unclear to me which method would be best for my study, if another method would be better, or if I should report the results using both methods (although this sounds like it could be confusing for the reader).

Can you show the output you did get from loo_compare() and loo_model_weights()? I could then provide more detailed answer

Example 1

loo_compare
      elpd_diff se_diff
mod4     0.0       0.0 
mod10   -5.5       4.4 
mod1    -6.9       6.9 
mod2   -11.8       8.8 
mod3   -26.2       9.8 
mod9   -35.6      13.5 
mod11  -35.9       7.9 
mod5  -399.7      31.3 
mod6  -418.2      31.3 
mod13 -418.3      29.7 
mod8  -418.8      30.2 
mod7  -464.7      34.2 
mod12 -475.8      34.2 
mod14 -572.5      38.2 

loo_model_weights
Method: stacking
------
      weight
mod1  0.180 
mod2  0.106 
mod3  0.202 
mod4  0.097 
mod5  0.000 
mod6  0.000 
mod7  0.001 
mod8  0.000 
mod9  0.000 
mod10 0.191 
mod11 0.091 
mod12 0.000 
mod13 0.106 
mod14 0.025 

Example 2

loo_compare
> loo_out
      elpd_diff se_diff
mod3     0.0       0.0 
mod1    -0.4       3.4 
mod11   -1.7       6.3 
mod4    -2.8       5.7 
mod10  -91.1      14.0 
mod2   -94.2      13.6 
mod9   -96.3      13.5 
mod5  -153.9      16.1 
mod8  -175.5      19.1 
mod6  -253.7      19.1 
mod13 -278.0      21.0 
mod7  -346.4      28.2 
mod14 -398.9      24.4 
mod12 -402.2      29.5 

 loo_model_weights
Method: stacking
------
      weight
mod1  0.148 
mod2  0.000 
mod3  0.000 
mod4  0.011 
mod5  0.064 
mod6  0.000 
mod7  0.137 
mod8  0.122 
mod9  0.000 
mod10 0.194 
mod11 0.323 
mod12 0.002 
mod13 0.000 
mod14 0.000 

I should add… that these are not exactly nested models because I checked each model for multicollinearity and removed the highest predictor with a VIF>5, this process was then repeated to bring all VIF < 5.

These are so close to each other that we can’t say that one would be better than others. See more information at CV-FAQ 16 and the reference given there.

Stacking is not model selection, but model averaging and can give non-zero weight for a model that is on average much worse, but the mixture of weighted models can still be better. For example if the true distribution would be t-distribution and you have one narrow and one wide normal, one of the normal’s can be much worse alone, but the mixture can be closer to t-distribution than either normal, and thus the weights can be non-zero for both.

Your stacking weights are mostly non-zero for the models that have good predictive performance, but for example mod13 has weight 0.106, although it’s predictive performance alone is much worse than the performance of the best ones. This is likely due to mod13 having thicker tails in the predictive distribution and adding that to the mixture improves the combination just a little. This hints that it might be possible to modify one of the top models to have an observation model that better matches the true distribution.

These models are indistinguishable based on the predictive performance and the rest have much worse predictive performance.

I didn’t assume they are nested, and the stacking weights even hint that they are not nested as so many models have non-zero weight.

This will make the model comparison very difficult. Instead, I recommend using all the covariates and use projection predictive variable selection, especially if the only differences between the models are which covariates are used. If you tell more about the models, I may be able to provide further advice.

2 Likes

Thank you so much for engaging with me. What else would you like to know about the models?
I will read your article on Projection predictive variable selection.

I am fitting to a neg_binomial_2 family model, which does not seem to be supported by projpred. Here are two figures, first the ppc_dens_overlay, and second the mcmc_areas for the full model for one of my species (one with a particularly large number of 0 estimates!).


I think you could do this using the recently added latent space approach. The vignette has also negative binomial example Latent projection predictive feature selection

The marginal posterior hints collinearity, and projpred would probably work well

1 Like

Thank you.
I got as far as identifying the solution terms ( soltrms_lat_final in the example).
The example then refers to the main vignette for post-selection inference. That vignette calls the function project which does not accept neg_binomial_2 models.

I do not need to make predictions from the final model, but am rather looking for influential variables and the direction they modify the response variable. Can I simply use the solution terms identified to rerun the model using the whole data set?

Pinging @fweber144 who made the vignette, maybe he is able to help

Did you set latent = TRUE and the other arguments relevant for the latent projection (here: latent_ll_oscale = latent_ll_oscale_nebin and latent_ppd_oscale = latent_ppd_oscale_nebin) also in the project() call, not only in the varsel() or cv_varsel() call? This is explained in section Implementation, but I guess I should have pointed that out more clearly.

That sounds like you don’t have an actual prediction problem but rather want to interpret the parameter estimates. Just be warned that projpred is not made for parameter interpretation, but prediction problems.

2 Likes

You can avoid the repetition of the arguments related to the latent projection by defining the reference model object of class refmodel first (via projpred::get_refmodel() which redirects to projpred:::get_refmodel.stanreg() in your case; it is in this creation of the reference model object where you add the arguments related to the latent projection) and then calling varsel(), cv_varsel(), or project() on that predefined reference model object (instead of on the reference model fit) without the need for repeating the arguments related to the latent projection.

Alternatively, if you have run the code from the latent-projection vignette, you can call project() on the output of varsel() or cv_varsel() (instead of on the reference model fit) without the need for repeating the arguments related to the latent projection.

All of this might sound a bit complicated now but is in fact a quite natural approach when you get more familiar with the UI of projpred. Anyway, I guess I really need to revise that part of the latent-projection vignette. Sorry for that. I’ll put it on the issue tracker.

2 Likes

Again, thank you both for your help.
I was able to get the projections to work using the latent = TRUE argument.
Getting back to my problem. @avehtar I started down this path at your suggestion that I can use this approach instead of model selection. But once I have selected the best predictor terms in my model, I need to deviate from the vignettes to use those terms for parameter interpretation. Is it simply enough to rerun the model using the terms identified?

I missed this earlier (you tried to tag me, but missed one letter in my username).

  1. For the parameter interpretation you need to take into account causal assumptions and possible collinearities
  2. If you want to be careful, we recommend doing the interpretation with the reference model (the big model). Although the projection has nice decision theoretical interpretation, the actual implementation is such that there can be difference to the theory and we don’t have yet such diagnostic that we would be happy recommending using the projected posterior for interpretation (assuming causal assumptions allow that). For posterior interpretation (assuming causal assumptions allow that) the refitted model posterior with the selected variables, is likely to be safer than the projected posterior (the variable selection procedure itself is stable), but still we recommend to make possible causal interpretations using the reference model.

@ avehtari

Thanks for the tips. This approach seems to work very well for most of the species I am analyzing, but a few species have issues with the reference model. This is particularly the case for the species with fewer observations.

For example, for this species the mlpd plot:


and with ‘deltas = FALSE’ :

suggesting that the reference model is not good. Are there other ways I can improve the reference model? (I can’t add more observations). Here is the code for the reference model (following the Latent projection predictive feature selection example)

D ← 12
p0 ← 6
mu_prior ← 100
tau0 ← p0 / (D - p0) / sqrt(mu_prior) / sqrt(N)

refm_fit_nebin ← stan_glm(
formula = har ~
p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9 + p10 + p11 + p12,
family = neg_binomial_2(),
data = train,
prior = hs(global_scale = tau0, slab_df = 100, slab_scale = 1),
seed = 7304, QR = TRUE,
adapt_delta = 0.99,
iter = 15000,
cores = 4
)

@fweber144

I just saw this note “For the sake of simplicity, we won’t adjust tau0 to this new family, but in a real-world example, such an adjustment would be necessary. However, since Table 1 of Piironen and Vehtari (2017) doesn’t list the negative binomial distribution, this would first require a manual derivation of the pseudo-variance σ~2”

Has pseudo-variance for negative binomial distributions been estimated since you wrote this article?

I interpret the plots so that reference model is useful. Null model is much worse, and you need many variables before reaching the reference model performance.

Has pseudo-variance for negative binomial distributions been estimated since you wrote this article?

No, unfortunately not, at least to my knowledge. If you don’t want to derive it on your own (or if it’s not possible; I haven’t tried), you could try to use the pseudo-variance for the Poisson distribution (third line in Table 1 of Piironen and Vehtari, 2017). But that would be just an approximation.

Looping back to this comment, I have some clarifying questions.

I assume by ‘the reference model (the big model)’ you mean the model with all of the predictors (for which the vignette uses the training data). To clarify, I randomly sampled from my observations so that half of the observations were used as the training data and half as the test data.

Are you saying that for posterior interpretation refitting the model with the selected variables is safer than projecting the posterior? If so, do you use all of the observations for refitting with the selected variables or do you use the test data? I am having a hard time here understanding how the refitted model posterior with the selected variables can be used to make causal interpretations using the reference model if the reference model itself has all of the variables rather than the selected variables.

Yes. Then we need to worry only about the selection induced bias (which is reduced by projection predictive selection), but no need to worry the error from the approximate projection. There is some discussion in [2306.15581] Robust and efficient projection predictive inference

I always use all the data with combination of using cross-validation to assess the out-of-sample predictive performance. See also [2306.15581] Robust and efficient projection predictive inference

We primarily recommend doing the causal interpretation using the big model which has been constructed taking into account the causal constraints. Variable selection can be done to find irrelevant variables, but depending on the causal structure, the variable selection is valid only for some covariates. If you have made the variable selection honoring the causal constraints, then you could use the smaller model for the causal interpretation, but there is no benefit from using the smaller model for the causal interpretation.