This is my first time trying to use regularised regressions. I have two data sets. I want to build a predictive model (with cross validation) on the first and then validate on the second data set. In the first data set, I have ~ 670 rows (complete cases) and 155 predictors. I have removed any predictors that were correlated > abs(0.9) with each other (but there are quite a few still that are close to 0.9).
I am trying to follow the instructions from projpred: Projection predictive feature selection • projpred to build a regularised predictive model.
I seem to already get stuck at the first step - I’ve built my model, but it keeps getting divergent transitions.
x.formula = paste0(psych,'~',paste(names(df)[1:(length(names(df))-1)],collapse='+')) # note: I have scaled the df fit= brm(formula=(x.formula), data=df, family=gaussian(), prior=c(prior(normal(0,1),class='Intercept'), prior(horseshoe(par_ratio=0.1),class='b')), # assuming 10% of regressors might be significant? control=list(adapt_delta=0.99,max_treedepth=15), chains=3,iter=5000, seed=100)
I have ~20 divergent transitions. Rhat are ok. Is it ok to continue with the next steps for projpred or do I need to get rid of the divergent transitions? I see from the manuals that tuning adapt_delta or max_treedepth more is unlikely to help. I am not sure I’ve set par_ratio correctly. Is there anything I can further try to tune?
I’d be very grateful for any pointers.
(As a separate question, I think for now projpred cannot be used with imputation, is that correct? I’m loosing some rows and some regressors; I think I’d have ~700 rows and 170 regressors if I could include those with missing values - and they are likely not missing at random. I think I will need to also going to use some other modelling approaches that can deal with imputation to make sure the results are at least similar…)
Edit: I’ve come across several further questions I’m unsure about:
- could I just replace the horseshoe by other priors that would not produce divergent samples?
- if the original model is not better than a null model, i’m assuming running projpred does not make sense?
- if I want to validate my (cross-validated) model on new data, is the best way to do this to get a prediction (with proj_linpred) and then i’m not sure what to do to compare it to the new y - should I correlate the values? Do the prediction 10,000 with a random shuffle of the y-values? Or compare to the predictions of a model with only an intercept (using elpd)?