Hierarchical logistic regression on anomalies

rsteckel · March 26, 2018, 2:22pm

First off, I am new to Bayesian modeling. I could be completely missing an obvious solution.

I have a dataset containing verified anomalies that I would like to model using hierarchical logistic regression. Using this model, I would then be able to get a probabilistic score for new data as an anomaly probability.

Since anomalies are very rare (by definition), the dataset is highly imbalanced. I feel a Bayesian approach using Stan could help because it allows the use of a prior, which acts a regularization method, but also allows me to incorporate beliefs about why certain records are anomalous (some predictors are more indicative than others, even if there is very little data to support it).

However I’m having some problems. I’m using brms to build the model. Initially, I tried using HMC to estimate the model, but with around 30K records (15 fixed variables, 4 random effect and 6 levels in the grouping), I quickly ran out of memory.

The model is setup like this:

prior <- c(set_prior(‘normal(0, 3)’, class=‘b’),
set_prior(‘normal(2, .75)’, class=‘b’, coef=‘x11’))

anomaly_fit <- brm(class ~ (x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 +
x11 + x12 + x13 + x14 + x15) + (x2 + x4 + x5 + x12 | user_group),
data=training, family=bernoulli(link=‘logit’), cores=4, silent=FALSE,
control = list(adapt_delta = .95),
iter=10000, prior=prior, algorithm=‘fullrank’) #Use ADVI

I only see two options to have this scale up:

Sample the data by user_group
a. However, even a sample of a few thousand records per group is very memory intensive
b. How do I know what sample size to use? At some point, with such highly imbalanced classes, it seems the data will quickly overpower the priors.
Use ADVI
a. How would I know how good the approximation is? Is it trustworthy?

Has anyone compared these two alternatives? Or does anyone have some suggestions?

bgoodri · March 26, 2018, 2:26pm

That should not have happened, so something else must have gone wrong.

Topic		Replies	Views
Improving Performance on Logistic Regression with Informative Priors Modeling performance , rstanarm	4	1518	May 1, 2020
Is it possible to run Bayesian hierarchical model with 10million observations? General rstan , hierarchical-model , brms	20	2566	June 28, 2021
Fitting hierarchical logistic regression to large dataset Modeling performance	7	662	January 17, 2020
Hierarchical Bayesian Poisson regression model Modeling fitting-issues	7	2320	August 17, 2019
Multilevel (Hierarchical) Bayesian Model in R General rstan , hierarchical-model , brms	4	1141	March 21, 2023

Hierarchical logistic regression on anomalies

Related topics