The outcome I want to model takes the values 0 or 1 and is very imbalanced. That is, 97% of the observations take the value 0. The machine learning literature recommends doing subsampling when you have data like that. **What is the approach that you would recommend when building a Bayesian model for predicting this type of data?** Could you point me to a paper or case study?

It depends on what you want to do. If youâ€™re trying to model the *population* but the population is not as imbalanced as your sample then you can try to subsample to create a new sample thatâ€™s more representative of the total population. This is a crude form of post-stratification. If youâ€™re trying to model the *individuals* then you fit the outcome based on the covariates for each individual in which case the imbalance doesnâ€™t matter.

This form of subsampling is also used in a naive attempt to improve the â€śsignal to noiseâ€ť in the data, artificially increasing one of the rarer classes to provide more instances of that class from which to learn. From a modeling perspective this selection process just has to be treated as part of the observation model (data missing at random conditional on outcome) which is straightforward to incorporate into a Bayesian model.

The machine learning literature is not helpful in cases like these because various papers are recommending the method for *different* reasons. In general itâ€™s because the techniques being used on the data presume relatively equally populated classes, but exactly how that presumption manifests depends on the details of the model. Hence to build a corresponding Bayesian model you have to figure out what those presumptions are and model them explicitly.