Scrutinising logistic regression models with strong class imbalance

Dear Stanimals,

do you have any recommendations, best practises or case studies for logistic regression with highly unbalanced datasets (in my case 478 negatives versus 12 positives)? In particular I look for test statistics for posterior predictive checks and Loo as well as things to consider when doing model comparison? How would you approach it in general?

Many thanks!

PS.: Additionally, the dataset comes with collinearity in some binary regressors and a “hard” mutual exclusion between some of the previously mentioned binary regressors and the dependent binary output, that is if one of the collinear binary regressors is 1 then the output is definitely 0 (in the dataset and based on some theoretical arguments).

If there is a theoretical argument for this, then you should code in your model. How many observations you have after you have coded this rule?

How many covariates?

Here the problem is not just the inbalance, but also having so few observations from the other class. This makes it very easy to overfit, and very difficult to make any predictive checks.

The rule of thumb on the frequentist side is not to have a ratio of less than 10 observations of the rarest outcome per predictor. I’m sure this can be stretched using Bayesian machinery, but ultimately, with only 12 outcomes of one type I suspect there’s only so much (i.e. not much) statistical inference that can be done with that dataset.

We fit Bayesian and non-Bayesian models with 0.0012 observations of the rarest outcome per predictor in [1810.02406] Projective Inference in High-dimensional Problems: Prediction and Feature Selection.

1 Like

5-8.

In an alternative case I might have same outcome (dependent variable) but 80-100 mutation indicator variables for oncogenes. Would projpred and horseshoe prior makes sense here?

After you handle those “hard” cases separately? I would recommend calibration plots as shown in Bayesian Logistic Regression with rstanarm, but only 12 positives is a problem, and I guess you can only do visual inspection. Same for loo, you can use it to examine influence of each observation but it’s difficult to make any good summary with just 12 positives, unless you have additional strong prior information.

I would say yes, but if you have only 12 positives you have to be extra careful in building the full / reference model.

Thanks. I should be more precise and was wrong above. Actually it’s 4 covariates that are mutually exclusive among each other and all are exclusive with the outcome. There is then 6 additional covariates, three of which are binary, one continuous and 2 categorical with around 5-10 classes.

Could you elaborate what needs particular attention here?

Sorry, if I would know I would have already told, but with this few positives you have to rely mostly on your domain expertise and be prepared to investigate anything you find suspicious. I can comment after you have more specific questions.