Scrutinising logistic regression models with strong class imbalance

ermeel · February 14, 2019, 8:21pm

Dear Stanimals,

do you have any recommendations, best practises or case studies for logistic regression with highly unbalanced datasets (in my case 478 negatives versus 12 positives)? In particular I look for test statistics for posterior predictive checks and Loo as well as things to consider when doing model comparison? How would you approach it in general?

Many thanks!

PS.: Additionally, the dataset comes with collinearity in some binary regressors and a “hard” mutual exclusion between some of the previously mentioned binary regressors and the dependent binary output, that is if one of the collinear binary regressors is 1 then the output is definitely 0 (in the dataset and based on some theoretical arguments).

avehtari · February 14, 2019, 8:43pm

If there is a theoretical argument for this, then you should code in your model. How many observations you have after you have coded this rule?

How many covariates?

Here the problem is not just the inbalance, but also having so few observations from the other class. This makes it very easy to overfit, and very difficult to make any predictive checks.

blokeman · February 14, 2019, 9:25pm

The rule of thumb on the frequentist side is not to have a ratio of less than 10 observations of the rarest outcome per predictor. I’m sure this can be stretched using Bayesian machinery, but ultimately, with only 12 outcomes of one type I suspect there’s only so much (i.e. not much) statistical inference that can be done with that dataset.

avehtari · February 15, 2019, 5:39pm

We fit Bayesian and non-Bayesian models with 0.0012 observations of the rarest outcome per predictor in [1810.02406] Projective Inference in High-dimensional Problems: Prediction and Feature Selection.

ermeel · February 16, 2019, 1:34pm

5-8.

In an alternative case I might have same outcome (dependent variable) but 80-100 mutation indicator variables for oncogenes. Would projpred and horseshoe prior makes sense here?

avehtari · February 16, 2019, 3:20pm

After you handle those “hard” cases separately? I would recommend calibration plots as shown in Bayesian Logistic Regression with rstanarm, but only 12 positives is a problem, and I guess you can only do visual inspection. Same for loo, you can use it to examine influence of each observation but it’s difficult to make any good summary with just 12 positives, unless you have additional strong prior information.

I would say yes, but if you have only 12 positives you have to be extra careful in building the full / reference model.

ermeel · February 17, 2019, 4:22pm

Thanks. I should be more precise and was wrong above. Actually it’s 4 covariates that are mutually exclusive among each other and all are exclusive with the outcome. There is then 6 additional covariates, three of which are binary, one continuous and 2 categorical with around 5-10 classes.

Could you elaborate what needs particular attention here?

avehtari · February 18, 2019, 8:06am

Sorry, if I would know I would have already told, but with this few positives you have to rely mostly on your domain expertise and be prepared to investigate anything you find suspicious. I can comment after you have more specific questions.

Topic		Replies	Views
Question re: classification model evaluation General	2	417	October 12, 2020
Advice on logistic regression Modeling	3	272	December 24, 2023
Generative model with class imbalances? Modeling	2	763	August 1, 2023
Hiearchical logistic regression - low event rate in one binary predictor and in the dv & assessing model fit Modeling techniques , specification , loo , hierarchical-model , model-comparison	1	429	September 13, 2023
Cost sensitive logistic regression? Modeling	8	1486	September 3, 2018

Scrutinising logistic regression models with strong class imbalance

Related topics