Thank you for developing the projpred package. I am trying to apply the variable selection in the projpred R package to a Poisson model that I fit in brms. The model includes several ordinary covariates (and possible interactions) and a CAR (conditional autoregressive) component for the spatial dependencies of US counties. When I tried to perform the variable selection in projpred to the brms fit, I got an error message that this is not yet implemented in brms. Do you know if it is implemented in Stan or how I should handle the CAR component? I saw that you wrote an article “‘Projection predictive model selection for Gaussian processes” and if I am correct CAR models are a form of Gaussian processes. I want to do variable selection only for the ordinary covariates but I think I should include the CAR component to adjust for spatial confounding.
projpred needs to know some things about the models. In the cases of rstanarm and brms, for a set of models including normal/generalized/additive/hierarchical linear models, projpred knows enough about the models. In theory, car model should be similar to hierarchical models supported, but even then adding the support for car requires some coding, and we have limited resources. It is also possible that implementing the projection for car may require taking into account something special about car, so it would require a bit of thinking and experimenting, too. If you are interested just in the variable selection, and don’t need to project the car part, it would be possible for you to implement the get_refmodel and init_refmodel functions yourself, but as this is not the simplest case, I do understand if this is beyond your current skill set. I’m pinging @fweber144 and @AlejandroCatalina if they have something to add.
As the projpred support for car models would not be quickly available, I’m also checking whether you really need projpred or if some other approach would be good. How many covariates do you have? How many interactions? How many observations? What is the purpose of the variable selection?
Thank you for the clarification. I will have a look at the get_refmodel and init_refmodel functions.
I have about 60 variables among which some are highly correlated. Additionally, several variables form groups (e.g. pesticide group consisting of three single pesticides). I am not sure how to best deal with the interactions. There are multiple plausible interactions. Is it a good approach to include all of them in the first step or what is the general suggested approach for interactions? I have 159 observations (counties) and many of the counts in my Poisson model are zero so I was trying to fit a zero inflated Poisson model. The purpose of the variable selection is to identify all relevant variables and interactions for the outcome. I read in your article that this was not the primary purpose of projpred (as opposed to variable selection for finding a minimal set of variables) but that it would still work good in this case.
With 60 variables and interactions and only 159 observations this is a challenging task. Did you use something like horseshoe or R2D2 prior for the coefficients? For the interactions, it would be good to use prior that would allow big interaction only if the main effects are big. projpred performs also better when using good priors, so it’s good to start from there.
I think @avehtari has mentioned all major points, so I don’t have much to add.
I can help with init_refmodel() if necessary. In that case, a reproducible example would be good.
For the projection of CAR components, you are welcome to create a feature request issue on projpred’s issue tracker. However, as mentioned by @avehtari, this is not likely to be implemented in projpred soon.
In my understanding, using projpred for complete variable selection (as opposed to minimal subset variable selection) is not trivial. Pavone et al. (2022) have conducted experiments for this. Which article did you refer to?
Pavone, F., Piironen, J., Bürkner, P.-C., & Vehtari, A. (2022). Using reference models in variable selection. Computational Statistics. DOI: 10.1007/s00180-022-01231-6
I have not implemented any variable selection yet because I wanted to decide on the theoretical approach first. I will definitely try out the priors suggested. Is there a method to take into account if multiple variables belong to a bigger groups (e.g. 3 pesticides all belong to a pesticide group) in the variable selection?
Thank you for offering your help with the init_refmodel(). I will try to get started and come back to your offer.
I refered to the article “Projective inference in high-dimensional problems: Prediction and feature selection” [Projective inference in high-dimensional problems: Prediction and feature selection] by Juho Piironen, Markus Paasiniemi and Aki Vehtari, where they authors say “The empirical evidence indicates that the reference model approach could be highly useful also in this problem setting since it tends to help rank the truly relevant features before the irrelevant ones”. I will look more closely to the experiments you mentioned. Would you still recommend projpred in this case or what other approach would you suggest?
Arguments search_terms or penalty of varsel() and cv_varsel() might be helpful for that.
Ah ok. The subtle point, however, is that they refer to the more general reference model approach, not projpred specifically. The reference model approach is one of several important aspects of projpred and was later investigated in a more general framework by Pavone et al. (2022), also with respect to complete variable selection. So I guess Pavone et al. (2022) tackled the omitted part of the citation: “but the topic requires more research.”. I don’t want to say that complete variable selection is impossible with projpred, but as can be seen from the iterative projpred procedure in Pavone et al. (2022, section “4.1 Iterative projections”), it requires a quite sophisticated approach.
Thank you for pointing that out. So it would be probably a good approach to compare the results from the reference model approach with projpred in my setting also to other methods for complete variable selection like the local false discovery rate?
You mean that you want to use methods other than projpred (that are made for complete variable selection) as a “gold standard” to see whether the projpred results can be interpreted as complete variable selection results? But in that case, why would you need (the non-iterative) projpred at all? Apart from that, if you have correlated predictors, projpred will likely give a different (sparser) solution, so in that case, I’m not sure whether a comparison makes sense in the first place.
Also, depending on your model, implementing the local FDR approach might not be that easy.