First off, I am new to Bayesian modeling. I could be completely missing an obvious solution.
I have a dataset containing verified anomalies that I would like to model using hierarchical logistic regression. Using this model, I would then be able to get a probabilistic score for new data as an anomaly probability.
Since anomalies are very rare (by definition), the dataset is highly imbalanced. I feel a Bayesian approach using Stan could help because it allows the use of a prior, which acts a regularization method, but also allows me to incorporate beliefs about why certain records are anomalous (some predictors are more indicative than others, even if there is very little data to support it).
However I’m having some problems. I’m using brms to build the model. Initially, I tried using HMC to estimate the model, but with around 30K records (15 fixed variables, 4 random effect and 6 levels in the grouping), I quickly ran out of memory.
The model is setup like this:
prior <- c(set_prior(‘normal(0, 3)’, class=‘b’),
set_prior(‘normal(2, .75)’, class=‘b’, coef=‘x11’))
anomaly_fit <- brm(class ~ (x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 +
x11 + x12 + x13 + x14 + x15) + (x2 + x4 + x5 + x12 | user_group),
data=training, family=bernoulli(link=‘logit’), cores=4, silent=FALSE,
control = list(adapt_delta = .95),
iter=10000, prior=prior, algorithm=‘fullrank’) #Use ADVI
I only see two options to have this scale up:
- Sample the data by user_group
a. However, even a sample of a few thousand records per group is very memory intensive
b. How do I know what sample size to use? At some point, with such highly imbalanced classes, it seems the data will quickly overpower the priors. - Use ADVI
a. How would I know how good the approximation is? Is it trustworthy?
Has anyone compared these two alternatives? Or does anyone have some suggestions?