Projpred interpretation


I am working on a project where we try to identify relevant predictors of a specific outcome.
We have 23 candidate predictors, among which there are 2-3 that we know by theory have to be associated with the outcome. Sample size is 73.

Before I decided to use projpred, I first ran some simple simulations to make sure that projpred can identify relevant predictors, given this low sample size and given that potentially 90% of the predictors are just random noise. It all worked out fine as projpred identified relevant predictors the majority of simulation runs.

Now, I ran the actual analysis and it shows very unexpected results (see figure 1). This is basically the opposite of what you would expect as this suggests the model with 0 predictors would perform best predictions on average. Since I know that 2-3 variables should be useful predictors I ran the analysis again only with those predictors included. Then you can see that the results are more as expected as RMSE or ELPD show improved model fit with increasing number of predictors even though the SD is very high.

I know that overfitting should cause in theory a model with more predictors to perform worse. But I was hoping that projpred would figure this out and propose a model with those 2-3 predictors as the best model nevertheless. E.g. as you can see in figure 2, if I only include 3 of the predictors, it suggest that the RMSE is lower with the 3 predictors than any of the models shown in figure 1.

Any idea what could have gone wrong @avehtari ?
How can I make sure that projpred finds those predictors that are relevant, even if the effect size is low and there are several irrelevant or noisy variables in the dataset?

1 Like

Looks strange, maybe a bad model misspecification and also a hard to predict target. Can you tell more about the model?


Hi @avehtari ,

thanks for getting back. For some reason I did not receive an email (as usually when somebody replies) so I thought nobody answered. I will update you what I found out:

  1. I simulated once more a dataset with similar N and then set all the \beta in the simulation to 0. In that case I find: The RMSE is very similar across submodels but increases slightly if more variables are included.

So that would not explain everything because RMSE is almost similar across models whereas in my case it increases much more when including more variables.

  1. I think it probably has to do with multiple imputation. I used the function brm_multiple which takes as input a list of imputed datasets. If I use instead the function brm and run it at each single imputed data set at a time, then this issue disappears and the RMSE first decreases to a minimum before it increases slightly, just as in your tutorial.

To test this idea, I went on to include multiple imputations in my simulations and indeed, if the missingness is high enough and I used multiple imputation (using mice in combination with brm_multiple), a similar picture emerges frequently.

So, it seems the combination of little predictive power and multiple imputation is the problem here. Please let me know if you have different thoughts on that. I would go on searching and provide you more info on data and model. But if this could be the issue I thought I would do the following to report the results: I still can use brm_multiple to estimate the vector of \beta. But to identify relevant variables with projpred I need to run projpred multiple times using the brm function and then I will consider those variables as important that are identified by projpred with a certain consistency across the imputed datasets (not sure yet how to define that but l could report the consistency for each variable).

Please let me know if you have any thoughts/questions. Thanks

1 Like

Very likely. In your first message, you didn’t mention use of imputation.

We have not tested projpred with this. It would be useful to create a new issue at Issues · stan-dev/projpred · GitHub, so that a) we can add informational message that this is not yet supported and b) think if it is feasible to support thus.

This is the correct solution. When running with one data, you can also get different search paths depending on the cross-validation fold and it is possible to report the probabilities of including certain variable at certain model size.

1 Like