Implementing projection predictive feature selection for custom model with `projpred`

I’m trying to familiarize myself with the projection predictive approach and projpred package, and thinking about implementing projpred methods for particular custom model class. However, the documentation on how to do this is somewhat sparse, scattered, and perhaps outdated (init_refmodel example with custom model · Issue #125 · stan-dev/projpred · GitHub). Any suggestions on how to decipher the minimal structure for this, or should I just try to mimic the implementations for brsm and/or rstanarm fits? Or are there cases where it could be more reasonable to just implement the whole method from scratch for specific model types according to the source literature? The models I am thinking are essentially constrained linear regression models.

Tagging @fweber144 @avehtari as they seem to be actively working on Github on projpred and also present here.

The documentation for the use of the latent projection with custom families is pretty comprehensive and has a nice worked example for the negative binomial.

2 Likes

Thanks! Didn’t realize to check that page.

1 Like

Hi @helske, there are two possibilities what you could mean by “custom model class” above:

  1. A “custom” family in the sense of a response family that is not supported by the traditional projection or the augmented-data projection.
  2. A custom reference model object, as explained in projpred: Projection predictive feature selection • projpred. This is what init_refmodel example with custom model · Issue #125 · stan-dev/projpred · GitHub referred to.

In the first case, I hope the reply by @rtnliqry helps.

In the second case, there is an example at Reference model and more general information — refmodel-init-get • projpred which might help (a reference to that example may also be found in the vignette section linked above).

2 Likes

Would it be possible to use a CmdStanR model object (eg, bernoulli with joint missing data imputation) as reference model in projpred?

Projection is based on minimizing the KL-divergence from the reference model predictive distribution to the constrained model predictive distribution for each reference posterior draw separately. For many data model distributions, this is equivalent or can be approximated with optimization of the constrained model parameters given mean of the reference model prediction for each reference model posterior draw. In case of joint missing data imputation, each reference model posterior draw includes draw from the missing data distribution, too. The optimization approach to minimize KL does not work well for these latent data parameters. We could approximate by keeping the latent data parameters as fixed, and optimize only other parameters, but then this would be the same as using multiple imputation approach, which is a big task to add as discussed in a github issue.

2 Likes

Concerning your question whether a cmdstanr model object can be used: I think init_refmodel() should allow this, but I haven’t tested it yet.

1 Like