Correlated predictors vs fewer predictors with small dataset?

Hi there,

I wonder if you might be able to help me with a variable selection conundrum i’m having when the dataset is fairly small and a couple of the predictors are highly correlated.

I am fitting a Bayesian mixed-effects model in brms, on a dataset of n = 85. I have 7 predictor variables I would ideally like to include. Two of these predictors are highly correlated = human population density and forest cover (-0.8), but the mechanisms by which they might affect the response variable are arguable different. Therefore ideally I would include both in the model, although this makes interpreting the estimates tricky.

Including both in the model is what I have done so far. However now I am wondering if given that I only have 85 data points, whether i should remove one of these correlated variables (still stating the correlation between these variables during interpretation)…considering the ‘rule of 30 data points per predictor’.

So i am trying to balance number of predictor variables (given my sample size) with inference about a greater number of predictors.

Any thoughts about the most appropriate decision to make here?

Many thanks!

Have you looked into outcome-blinded dimension reduction techniques such as clustering, principal components, etc.?

@Deronda hello, I’d say this depends on what you’re trying to do here. If the purpose of the modelling is to obtain evidence about the nature of the effects of each of human population density and forest cover, then I think there is no solution outside of providing more information. The lack of interpretability of the estimates is a natural result of the lack of unambiguous information about independent effects of these predictors. To me the question here is more about the collinearity, rather than simply the number of variables.

@js592 Hi there. I did look into this, but several sources I read opposed the use of PCA etc for selecting variables, which put me off

Hi @AWoodward - thanks for this! I agree, but what about in circumstances where one cannot provide more information to the model?

It depends on what your goals are. Outcome blinded PCA/other clustering techniques (or even just fitting the originally proposed) should be fine for prediction purposes – it’s just that you can only really learn about the combined effect of the two predictors (Collinearity in Bayesian models | Statistical Modeling, Causal Inference, and Social Science). I think where PCA and dimension reduction technique run into trouble is when you run response dependent procedure and don’t have enough data to suitably guard against overfitting and distribution shift. If you really need to estimate the true effects of population density or forest cover then as AWoodward said there really is no solution other than providing more information by collecting more data or using stronger priors.

1 Like

To learn about the independent effects of the predictors, in the linear modelling context; collect more data (though it could require much more), use a study design that controls the value of the predictors (may not be possible), or implement some strong priors.

I also appreciate this article (Jan Vanhove :: Collinearity isn't a disease that needs curing; frequentist analyses but the same principles apply).

Thank you @AWoodward and @js592.

@AWoodward - thank you, it was actually the Jan Vanhove article you posted that made me write this post. Whilst i would like to be able write about both and interpret carefully, i am wondering whether or not i should remove one of the variables anyway given i have a fairly small dataset?

For predictive purposes, it’s best to include them both.

For causal interpretation, considering what you have described it is likely that removing one of them is not making the interpretation any easier as the remaining variable coefficient is doing the work also for the dropped variable. If you want to make causal interpretations, you can consider what is possible without the need to think sample size.

If the causal interpretation of the coefficient is not needed, you can more easily estimate how much predictive information each variable has alone and together.