Generative model with class imbalances?

The outcome I want to model takes the values 0 or 1 and is very imbalanced. That is, 97% of the observations take the value 0. The machine learning literature recommends doing subsampling when you have data like that. What is the approach that you would recommend when building a Bayesian model for predicting this type of data? Could you point me to a paper or case study?

1 Like

It depends on what you want to do. If you’re trying to model the population but the population is not as imbalanced as your sample then you can try to subsample to create a new sample that’s more representative of the total population. This is a crude form of post-stratification. If you’re trying to model the individuals then you fit the outcome based on the covariates for each individual in which case the imbalance doesn’t matter.

This form of subsampling is also used in a naive attempt to improve the “signal to noise” in the data, artificially increasing one of the rarer classes to provide more instances of that class from which to learn. From a modeling perspective this selection process just has to be treated as part of the observation model (data missing at random conditional on outcome) which is straightforward to incorporate into a Bayesian model.

The machine learning literature is not helpful in cases like these because various papers are recommending the method for different reasons. In general it’s because the techniques being used on the data presume relatively equally populated classes, but exactly how that presumption manifests depends on the details of the model. Hence to build a corresponding Bayesian model you have to figure out what those presumptions are and model them explicitly.