Hello –
I’m trying to fit a logistic regression with Stan using rstanarm and I’m running into some performance issues. My last run took about a day to train and I plan to test sensitivity to priors and other assumptions, so how to speed this puppy up.
I’m new to using Bayesian data analysis in the wild - hoping for some advice/pointers/wisdom from folks who are more familiar with these tools and can perhaps see where I’m misstepping.
I’ve spent a bit of time searching through the forum and elsewhere online. It seems like this post is closest to the issue I’m experiencing but it feels like my case is simpler.
I’ve written what feels like a fair amount, so I’ve organized the detail into sections below.
Thank you!
Problem Overview
I’m trying to fit a logistic regression with rstanarm’s stan_glm using informative priors on certain variables. I started training the model yesterday evening and it completed 800/1000 steps before crashing this afternoon. For comparison, fitting the model with glm takes under one minute.
The model is for use in a predictive setting, so I’d like to test impact of using different prior parameterizations, amongst other sensitivity and performance analyses. Given that I’d like to run several tests/versions of the model, I’d like to avoid taking a full day to train one version.
I currently do not have access to a cluster – please assume that I only will be able to run this on a MacBook Pro.
Model and Data
A few quick points about the data and my current model setup is below.
-
Dataset consists of ~1MM rows, so fairly large but nothing extreme by today’s standards
-
Including the intercept, I have 25 variables in my regression
-
On the first 20, I’m using uninformative priors (I’ve used the default rstanarm priors thus far).
-
On the last 5, I’m using priors both on location and scale, informed from a separate regression. These are point estimates for the location and scale – nothing fancy in terms of hyperpriors or anything.
-
I’m currently using the normal distribution as prior distribution. Haven’t played with distributions beyond the normal distribution.
-
-
The model is an econometric model – the last 5 variables are economic variables. The other variables are behavioral.
On the last point, the dataset is effectively panel data where the economic variables only cut across a few years. I suppose it’d be more correct to use some sort of FE or RE style model. I’m not super familiar with using them but I’m incorporating other variables related to time.
Solutions I’ve Tried or Explored
- The QR Decomposition
Based on recommendations I’ve seen on this forum and elsewhere, this seems like a magic bullet of sorts in terms of speeding-up fit timing. Indeed, I tried this and my regression ran in ~4 hours instead of running overnight and failing the afternoon after.
However, the results of that regression didn’t make much sense in that the variables for which I had a strong prior had unintuitive results. For instance, one variable with an informative prior has a prior distribution of normal(0.6, 0.045^2) – the posterior estimate was -410 with MAD_SD of 3.
Moreover, the Stan user guide mentions:
Consequently, this QR reparameterization is recommended for linear and generalized linear models in Stan whenever K>1 and you do not have an informative prior on the location of \beta.
- Data Sampling
Initially, I tested the regression on a small simple random sample, without the QR decomposition but manually scaled the priors scales by the corresponding in-sample SDs. This worked in terms of getting an answer relatively quickly but the resulting coefficients (with or without informative priors) weren’t close to the full-sample coefficients in the glm results.
I’ve looked into taking balanced and/or stratified samples but those algorithms take a fair amount of time on my dataset as well. Also seemed a little out there in terms of common practice – especially with a dataset of ~1MM observations.
- Fitting regression in multiple stages
It sounds like standardizing variables tends to help with performance, so I’m wondering if I break-up my regression training into two components:
-
First, train variables for which I’m using non-informative prior, centering/scaling those variables.
-
Second, perform an “update” step by adding-in variables for which I want to use an informative prior.
Haven’t seen anything like this online - seems awkward/hacky.
Additional Information/Questions
I’m using rstanarm for convenience but am not opposed to writing a custom Stan program, if that would help with performance.
This is one of several models I will be developing that will feed into a transition matrix-style risk model. A similar example is here.
Obviously my post is about using a logistic regression instead of a multinomial regression, as discussed in the link. I plan to test stacking separate logit models in a one-vs-all approach and using a multinomial approach to see which performs better. Ideally, I want to use different variable sets for each alternative.
If there’s a way to make this more efficient without using the logit approach and instead using multinomial or something similar, I’m happy to explore that avenue…