Generative model with class imbalances?

ignacio · October 7, 2018, 9:55am

The outcome I want to model takes the values 0 or 1 and is very imbalanced. That is, 97% of the observations take the value 0. The machine learning literature recommends doing subsampling when you have data like that. What is the approach that you would recommend when building a Bayesian model for predicting this type of data? Could you point me to a paper or case study?

betanalpha · October 7, 2018, 1:55pm

It depends on what you want to do. If you’re trying to model the population but the population is not as imbalanced as your sample then you can try to subsample to create a new sample that’s more representative of the total population. This is a crude form of post-stratification. If you’re trying to model the individuals then you fit the outcome based on the covariates for each individual in which case the imbalance doesn’t matter.

This form of subsampling is also used in a naive attempt to improve the “signal to noise” in the data, artificially increasing one of the rarer classes to provide more instances of that class from which to learn. From a modeling perspective this selection process just has to be treated as part of the observation model (data missing at random conditional on outcome) which is straightforward to incorporate into a Bayesian model.

The machine learning literature is not helpful in cases like these because various papers are recommending the method for different reasons. In general it’s because the techniques being used on the data presume relatively equally populated classes, but exactly how that presumption manifests depends on the details of the model. Hence to build a corresponding Bayesian model you have to figure out what those presumptions are and model them explicitly.

anhsmith · August 1, 2023, 12:31am

Dear @betanalpha. I don’t suppose you could point me to some examples of this please?

Topic		Replies	Views
Question re: classification model evaluation General	2	417	October 12, 2020
Scrutinising logistic regression models with strong class imbalance Modeling	7	1378	February 18, 2019
Bayesian Noob: Interpreting Mixture Model Output Modeling techniques , interpret-results	0	632	June 5, 2019
Predict outcome based on previous counts Modeling rstan	2	370	June 19, 2023
A Bayesian Subset Selection Model Modeling	0	486	December 26, 2018

Generative model with class imbalances?

Related topics