This is my first time trying to use regularised regressions. I have two data sets. I want to build a predictive model (with cross validation) on the first and then validate on the second data set. In the first data set, I have ~ 670 rows (complete cases) and 155 predictors. I have removed any predictors that were correlated > abs(0.9) with each other (but there are quite a few still that are close to 0.9).
I am trying to follow the instructions from projpred: Projection predictive feature selection ā¢ projpred to build a regularised predictive model.
I seem to already get stuck at the first step - Iāve built my model, but it keeps getting divergent transitions.
x.formula = paste0(psych,'~',paste(names(df)[1:(length(names(df))-1)],collapse='+'))
# note: I have scaled the df
fit= brm(formula=(x.formula),
data=df,
family=gaussian(),
prior=c(prior(normal(0,1),class='Intercept'),
prior(horseshoe(par_ratio=0.1),class='b')), # assuming 10% of regressors might be significant?
control=list(adapt_delta=0.99,max_treedepth=15),
chains=3,iter=5000,
seed=100)
I have ~20 divergent transitions. Rhat are ok. Is it ok to continue with the next steps for projpred or do I need to get rid of the divergent transitions? I see from the manuals that tuning adapt_delta or max_treedepth more is unlikely to help. I am not sure Iāve set par_ratio correctly. Is there anything I can further try to tune?
Iād be very grateful for any pointers.
(As a separate question, I think for now projpred cannot be used with imputation, is that correct? Iām loosing some rows and some regressors; I think Iād have ~700 rows and 170 regressors if I could include those with missing values - and they are likely not missing at random. I think I will need to also going to use some other modelling approaches that can deal with imputation to make sure the results are at least similarā¦)
Edit: Iāve come across several further questions Iām unsure about:
- could I just replace the horseshoe by other priors that would not produce divergent samples?
- if the original model is not better than a null model, iām assuming running projpred does not make sense?
- if I want to validate my (cross-validated) model on new data, is the best way to do this to get a prediction (with proj_linpred) and then iām not sure what to do to compare it to the new y - should I correlate the values? Do the prediction 10,000 with a random shuffle of the y-values? Or compare to the predictions of a model with only an intercept (using elpd)?