Hi, I’m using the projpred to find a smaller submodel that predicts similarly to a full regularized horseshoe model with 31 predictors. I used the cv_varsel() function to fit the submodels & compare them using PSIS-LOO. Now, when I use the suggest_size() function, it suggests 12 variables as the optimal number of variables to approximate the predictions of the full model. However, when I plot the trajectory of the selection using the varsel_plot() function, I get the following plot:

Does the model start to overfit after 6 predictors & give worse out-of-sample predictions as a result (and therefore the optimal number of variables in the submodel would actually be 6)? Or is there something wrong-er going on?

Yes, I would interpret this as overfitting. I’ve never found suggest_size to be particularly helpful, as it returns only a number, and often, as in your case, it can pick a more complex model than perhaps necessary. The graph is much clearer and lets you judge better the trade-off betweem model size and predictive performance.

I would not interpret this as overfitting. Overfitting could be part of the problem, but that term oversimplifies the issue and it’s not even well defined term (or if you have a good definition please tell).

My guess is correlating covariates with outliers in covariate values, and the model is misspecified and not modeling them appropriately. varsel_plot shows cross-validation results, and cross-validation can have some bias in case of misspecified models. I recommend to check correlation plot for the covariates and in more detail check what covariates are in the group from 7-12 and if they have suspicious values.

We use suggest_size() mostly when we need to run 1000 simulation experiments and want to automatically select the size. Otherwise we highly recommend to look at the plots as there might be implicit preference for a smaller model and the user can then decide how much predictive performance they are willing to sacrifice. The current varsel_plot() and suggest_size() are suffering a bit from the behavior of cross-validation when comparing models which are very similar. We are in progress of writing a paper describing the issue in more detail, and we have a better idea for suggest_size() which will hopefully be more robust.

I may be able to provide additional comments if you tell also

loo() output for the dull model

did you normalize covariates and target?

how did you select the regularized horseshoe prior parameter?

So, if I understand correctly, you suggest that the weird behavior may be the result of collinearity? I’ve checked bivariate correlations between the problematic predictors & there’s few moderately high ones, with the highest one being 0.63. Could that be it? If so, is there a way to address that, perhaps with more regularization? In terms of suspicious values, I’ve double checked the model summary and mcmc_areas() plot & I can’t see anything unreasonable, as in there aren’t any slopes much larger than I would expect based on background knowledge (is that what you meant)?

As for your questions:

I did scale and center all continuous predictors

I set 1 degree of freedom for the local student-t shrinkage of the horseshoe prior, for everything else I left the default

Here’s the loo() output for the full model:

Computed from 8000 by 593 log-likelihood matrix
Estimate SE
elpd_loo -626.6 19.6
p_loo 24.5 2.0
looic 1253.1 39.1
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Not just the collinearity. Look also how the marginal distribution of the covariate values look like. If there are strange values, then including one of the correlating variables can throw the predictions off, but when you include many correlating values then effect of the strange values in each individual covariate will go down.

Not more regularization, but more modeling. If it’s about strange measurement errors in covariates, you should include measurement error model. There can be also cases, where you just need to include group of correlating covariates for better performance.

No, look at the covariate values. Are there values which are marginally or jointly far away from the other values?

You should set the global scale parameter as instructed in https://projecteuclid.org/euclid.ejs/1513306866
For example, if you assume a priori that there could 5 practically non-zero co-efficients then use

p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)

where p is the total number of covariates and n is the number of observations. With your p and n and p0=5, the default global scale is not far away from what the above formula gives.

p_loo value combined with the information that there are correlating covariates, indicates that most covariates are relevant, so 12 covariates from projpred is not that surprising.

As all Pareto k estimates < 0.5, there are no really bad outliers, but then you have 20 times more observations than covariates which makes the posterior easier. projpred might work equally well with Gaussian prior.

I’m sorry I’m still not very well-versed in the terminology, but I’ve looked at the marginal posterior distributions of all predictors individually, and at the pairwise joint marginal distributions as well. The marginal distributions for single predictors look reasonable to me, and I can’t find any joint pairwise posteriors that would not overlap zero (while individual predictors do). Is that the right way to go about it?

I’ve tried to fit the model with a normal(0, 1) prior (a “Bayesian ridge”, the way I understand it), and it seems to give a little more reasonable looking selection trajectory plot:

Additionally, the suggest_size() output is now 5. However, cv_varsel() now gives the following message:

Warning messages:
1: posterior_linpred(transform = TRUE) is deprecated.Please use pp_expect() instead.
2: Quick-TRANSfer stage steps exceeded maximum (= 400000)
3: Quick-TRANSfer stage steps exceeded maximum (= 400000)

Is that something to worry about?

Also, I have a test set that I can test the submodel predictions on, is that something I should do now or should I wait before I get a more well-behaved feature selection?

No problem, I’ll keep trying to use more explicit sentences.

Don’t look at the posterior. I asked you to look at the data. The data is the one you have before using Stan. I said “covariate values”, you used in the topic title “variable” instead of “covariate”, but that can lead to confusion as the model can also have “variables”. In the above you use “predictor” which is same “covariate” and good word for that, too. The predictor is data and that doesn’t have posterior. The model has coefficients or parameters (or weights) corresponding to the predictors. These coefficients/parameters have posterior. Now look at the data and not the posterior.

This can happen since n>>p and the predictors are correlating.

It’s difficult to see from the plot whether the full model performance is similar with RHS or normal prior?

First warning not. 2. 3. warning I don’t know, but @AlejandroCatalina might know the reason.

I think they are reasonably well behaved. In both cases there is similar bump in the curve, and it could bee good to check those predictor values.

These happen when clustering the data points with kmeans and some of the data points are so close together that the algorithm does not converge in the maximum limit of steps. I don’t think these are a reason to worry in this particular case.

Nonetheless, I’ll keep that in mind to make the kmeans step a bit more robust now that we are working on improving projpred in general!

I’ve double checked the data. I noticed there was one continuous covariate that I missed in my data pre-processing pipeline and didn’t standardize (very stupid mistake). I’ve fixed it and now the selection trajectory looks like this:

The trajectory looks smoother, there’s still the same bump but it’s smaller now. The suggest_size() output is still 5. The unstandardized covariate did not appear until late in the feature selection trajectory, and it doesn’t now either, but I suppose it might have been messing with the parameter space.

As for looking at pairwise correlations between the covariates, there are few pairs of moderately highly correlated covariates, and as far as I can tell from looking at the marginal posteriors, what the model seems to do with each pair is to choose the stronger covariate & squeeze the other one towards 0. To me, that seems like a desirable behavior, at least for the purpose of building the lighter submodel with projection. Is that reasonable or is the collinearity something to worry about?

Yes, based on LOO ELPD and RMSE (on the training data), the horseshoe performed about the same as the gaussian model does.

The bump starts to appear after a somewhat unbalanced binary covariate (160 obs in one level, 433 in the other) is included (covariate #7). The other covariates before, and many after, are continuous. Could that be it?

projpred separates 1) making the best model with all covariates and 2) finding the smaller submodel with projection conditioanl on the best possible model from 1. When doing 1, we don’t need to try to squeeze many coefficient marginals towards 0 to make 2 to work well. When doing 1, we should use the best possible model and when we have correlating variables RHS is not exactly the correct model even if it can produce very good models. The reason why we use and recommend RHS even in the case of correlating covariates is that if there are only a small proportion of relevant covariates RHS is good for that and adding the extra part taking into account the correlations between those relevant covariates is complicated and based on the experiments has most of the time quite small effect on the model performance. If in your case Gaussian and RHS prior produce as good models in 1, then there shouldn’t be much difference in 2.

This was expected. You didn’t have very large number of covariates and you had n>>p, so you are not yet in the regime where you could more easily see differences between RHS and Gaussian prior.

Probably. Adding a binary covariate can changesdistribution of individual predictive density values more and cross-validation can have higher variance for added binary covariates.

Thank you so much for your time @avehtari! I’ve learned a lot, and I really appreciate the effort you go into in making awesome tools like projpred and helping other people to use them. Thank you @AlejandroCatalina as well for getting back to me about the error messages!