Improving Performance on Logistic Regression with Informative Priors

Hello –

I’m trying to fit a logistic regression with Stan using rstanarm and I’m running into some performance issues. My last run took about a day to train and I plan to test sensitivity to priors and other assumptions, so how to speed this puppy up.

I’m new to using Bayesian data analysis in the wild - hoping for some advice/pointers/wisdom from folks who are more familiar with these tools and can perhaps see where I’m misstepping.

I’ve spent a bit of time searching through the forum and elsewhere online. It seems like this post is closest to the issue I’m experiencing but it feels like my case is simpler.

I’ve written what feels like a fair amount, so I’ve organized the detail into sections below.

Thank you!

Problem Overview
I’m trying to fit a logistic regression with rstanarm’s stan_glm using informative priors on certain variables. I started training the model yesterday evening and it completed 800/1000 steps before crashing this afternoon. For comparison, fitting the model with glm takes under one minute.

The model is for use in a predictive setting, so I’d like to test impact of using different prior parameterizations, amongst other sensitivity and performance analyses. Given that I’d like to run several tests/versions of the model, I’d like to avoid taking a full day to train one version.

I currently do not have access to a cluster – please assume that I only will be able to run this on a MacBook Pro.

Model and Data
A few quick points about the data and my current model setup is below.

  • Dataset consists of ~1MM rows, so fairly large but nothing extreme by today’s standards

  • Including the intercept, I have 25 variables in my regression

    • On the first 20, I’m using uninformative priors (I’ve used the default rstanarm priors thus far).

    • On the last 5, I’m using priors both on location and scale, informed from a separate regression. These are point estimates for the location and scale – nothing fancy in terms of hyperpriors or anything.

    • I’m currently using the normal distribution as prior distribution. Haven’t played with distributions beyond the normal distribution.

  • The model is an econometric model – the last 5 variables are economic variables. The other variables are behavioral.

On the last point, the dataset is effectively panel data where the economic variables only cut across a few years. I suppose it’d be more correct to use some sort of FE or RE style model. I’m not super familiar with using them but I’m incorporating other variables related to time.

Solutions I’ve Tried or Explored

  1. The QR Decomposition

Based on recommendations I’ve seen on this forum and elsewhere, this seems like a magic bullet of sorts in terms of speeding-up fit timing. Indeed, I tried this and my regression ran in ~4 hours instead of running overnight and failing the afternoon after.

However, the results of that regression didn’t make much sense in that the variables for which I had a strong prior had unintuitive results. For instance, one variable with an informative prior has a prior distribution of normal(0.6, 0.045^2) – the posterior estimate was -410 with MAD_SD of 3.

Moreover, the Stan user guide mentions:

Consequently, this QR reparameterization is recommended for linear and generalized linear models in Stan whenever K>1 and you do not have an informative prior on the location of \beta.

  1. Data Sampling

Initially, I tested the regression on a small simple random sample, without the QR decomposition but manually scaled the priors scales by the corresponding in-sample SDs. This worked in terms of getting an answer relatively quickly but the resulting coefficients (with or without informative priors) weren’t close to the full-sample coefficients in the glm results.

I’ve looked into taking balanced and/or stratified samples but those algorithms take a fair amount of time on my dataset as well. Also seemed a little out there in terms of common practice – especially with a dataset of ~1MM observations.

  1. Fitting regression in multiple stages

It sounds like standardizing variables tends to help with performance, so I’m wondering if I break-up my regression training into two components:

  • First, train variables for which I’m using non-informative prior, centering/scaling those variables.

  • Second, perform an “update” step by adding-in variables for which I want to use an informative prior.

Haven’t seen anything like this online - seems awkward/hacky.

Additional Information/Questions

I’m using rstanarm for convenience but am not opposed to writing a custom Stan program, if that would help with performance.

This is one of several models I will be developing that will feed into a transition matrix-style risk model. A similar example is here.

Obviously my post is about using a logistic regression instead of a multinomial regression, as discussed in the link. I plan to test stacking separate logit models in a one-vs-all approach and using a multinomial approach to see which performs better. Ideally, I want to use different variable sets for each alternative.

If there’s a way to make this more efficient without using the logit approach and instead using multinomial or something similar, I’m happy to explore that avenue…

2 Likes

Uninformative priors can often cause trouble, especially if there’s colinearity in your data. Have you tried setting weakly informative priors on all your parameters? The coefficients in logistic regression are parametrized in log odds, so a normal(0,1) prior actually captures a range of possibly very strong coefficients. For example, the parameter values -2 and 2 aren’t unlikely under the normal prior (~ 5% of the distribution), but in log odds that would mean that a change of 1 unit in the predictor could change the probability of the outcome from 50% to 88% or 12% (i.e. exp(0) / exp(0 + 1) = 50%, exp(2) / (exp(2) + 1) = 88%, exp(-2) / (exp(-2) + 1) = 13%). So anything wider than normal(0, 1) is probably an overkill, especially in social science where you’re unlikely to get giant effects for any single variable.

Try setting a more informative prior and see what happens. Also, checks if there’s colinearity between your model parameters using mcmc_pairs() or Shinystan.

I completely agree with the normal(0,1) recommendation. If you do this, the assumption is that the predictors are scaled which I would recommend anyway, because it makes it easier for Stan’s warmup to find the typical set.

One caveat from the stan_glm help page

Note: Unless QR=TRUE, if prior is from the Student t family or Laplace family, and if the
 autoscale argument to the function used to specify the prior (e.g. normal) is left at its 
default and recommended value of TRUE, then the default or user-specified prior scale(s)
 may be adjusted internally based on the scales of the predictors. See the priors help page
 and the Prior Distributions vignette for details on the rescaling and the prior_summary 
function for a summary of the priors used for a particular model.

Depending on whether QR=TRUE and autoscale=TRUE stan_glm is doing the scaling internally. This might explain why some models work and others not. The priors are implicitly different.

1 Like

Thank you @abartonicek and @stijn. I tried your suggestions and it ran in ~3 hours - certainly a huge improvement over 1 day! Results look much more intuitive as well.

For anyone encountering a similar problem, below is the list of what I changed:

  • Centered and scaled data for the regressors where I was previously using the uninformative priors
  • Shrank the uninformative priors to N(0,1)
  • Switched-off the QR decomposition
  • Used autoscale with the priors and adjusted my desired scale for priors on the last 5 variables by multiplying by their standard deviation. To @stijn’s point, stan_glm performs some adjustments when autoscaling. Since the variance of those variables is low, I think the adjustments we causing those priors to be incredibly diffuse.

In terms of setting more informative priors, I’ve been a little confused as to what counts as “informative” since it does depend on the problem. But I think what you both are saying is that we can setup more informative bounds based on our knowledge of the target distribution and what potential effects might be from a “reasonableness” sense. Effectively, we can use this to reduce the search space, which in turn helps with performance in training a large model.

I did more poking around on the QR decomposition - I found the description in the prior_summary help page informative. Essentially, when the QR decomposition is used, the priors are interpreted in terms of the matrix Q instead of the matrix X. That is, the priors are applied after the QR decomposition.

Is there a reason why stan_glm doesn’t perform a transformation so that our priors are interpreted in terms of X when we use the QR decomposition? Using the QR decomposition seems to improve performance a great deal - curious as to why it’s not a default. I played around with the QR transformation to see if I could specify priors in Q-space up front but wasn’t successful. Might be a weekend project but also realize there’s likely a good reason for why QR is switched off by default. In fact, the prior_summary page ends with:

If you are interested in the prior on \beta implied by the prior on \theta, we strongly recommend visualizing it as described above in the Description section, which is simpler than working it out analytically.

The visualizing referenced in that quote, from my reading, is to use posterior_vs_prior to visualize but that function is used after fitting. It’d be helpful to be able to visualize the prior on \beta (in terms of X) before fitting the model - not sure if there’s something existing that’s out there.

In any case, shrinking the priors helped substantially, so marking this as solved.

1 Like

Yep, I think that’s a very good way to summarize it. Except it’s not just about bounding the parameter space to speed up training, it’s also to make humanly reasonable inferences. A statistical model (without proper priors) has no notion of sensible bounds on parameter space, ones that even a 5-year old kid might come up with That’s something that becomes very obvious with colinearity. For example, if you were predicting a person’s lung capacity from age and then number of cigarettes they smoke per day, and age and smoking were very tightly correlated, your model might come up with some very weird estimates - in the extreme case, it might consider it reasonable to estimate that every unit of age adds infinity of lung capacity, and every unit of cigarettes substracts infinity of lung capacity + 10. However, if you add a prior that provides regularization (i.e. any at least weakly informative prior that puts less probability density on parameter values farther away from zero), the model will (correctly) estimate coefficients for both age and cigarettes as moderate & negative.

1 Like