Separable logistic regression performance optimizations

I’m experiencing significant performance slowdowns in STAN when solving logistic regression problems with separable or near-separable data.

Performance is important, so we have implemented a few optimizations. We use QR decomposition to reduce the dependence between parameters, and we also exploit sufficient statistics to reduce the time we spend computing the gradient. These optimizations have a noticeable impact on performance, but we’re still struggling in a few corner cases.

Here is our current model.

data {
  int<lower=0> num_predictors;  // Number of predictors
  int<lower=0> num_original_predictors;  // Number of predictors *before* qr decomposition
  int<lower=1> num_rows;  // Number of rows
  matrix[num_rows, num_predictors] predictors;
  array[num_rows] int<lower=0> num_successes;
  array[num_rows] int<lower=1> count;  // Number of repetitions
  matrix[num_original_predictors, num_predictors] r_star_inverse;
}
parameters {
  vector[num_predictors] theta_beta_tilde;
}
transformed parameters {
  vector[num_original_predictors] beta = r_star_inverse * theta_beta_tilde;
}
model {
  theta_beta_tilde ~ student_t(9, 0, 1);

  target += binomial_logit_lupmf(num_successes | count, predictors * theta_beta_tilde);
}

However, our model can run over 10-100x slower (3 hours vs 5 min) when there is separability or near-separability. Separability means that one or more of the predictors will diverge to infinity. Only the inclusion of a prior will ensure that the problem is well-defined.

However, even when our model is well-defined, the shape of the posterior seems to give STAN some difficulties.



It seems that we have a funnel shape posterior, that is well-known to cause STAN problems.

I would love to hear approaches for how I can speed up model solving when there is separability.

Here are some ideas that I’ve had:

Transforming the parameters

My guess as to why model-fitting is so slow is that curvature of the posterior along theta_beta_tilde[1] is high variable. STAN is spending equal time sampling from a small area with high likelihood and a much larger area with lower likelihood. However, I’m not sure what a good transform would be. Further, because the curvature is coming from the data and not the model, we’d have to numerical compute a transform.

Special-casing Separability

Maybe there’s another model that we can be using when we detect the separability? Is so, what would this be?

Adjusting Priors

Are there priors that would provide better shaped posteriors? I’ve played with various degrees of freedom for the Student t prior (including normal), and while some run faster than others, I haven’t yet found one that was “fast”

Adjusting Hamiltonian Monte Carlo fitting

Are there parameters that I can provide to STAN that would help fit this type of data better?

How you can help

Eager to hear what you all think of these ideas, or to hear what other ideas might help.

I’d be interested to hear what others say, but also just want to ask: What is your inferential goal in cases of complete separability?

You’re willing to entertain adjusting your priors, but in the case of separability the posterior can be quite sensitive to the choice of priors, and therefore in order for the posterior to be meaningful you might have to make a really considered choice of priors.

If you want to compute some particular posterior (based on some particular prior) then we should be thinking about whether a good reparameterization is available. But if you don’t really care about accurately computing any particular posterior (based on a particular prior), it might be worth asking in the first place whether you really need accurate MCMC inference for cases of separability.

2 Likes

That’s a fantastic question. I run this model on a wide range of datasets, and I don’t know a priori which ones have separability. Since there are only a few of them, I mainly just want something that works well enough.

More to the point, we have many datasets where the predictors, X, are the same, but the outcomes, y, differ. Rarely, the y's are a function of the X's and that’s when we get separability.

Here’s a more concrete example. Suppose that I add a predictor to X such that the outcome, y_1, has complete separability with respect to X. Then, I can use information about the posterior, \theta_1 to make qualitative inferences about the posterior, \theta_2 of the outcome, y_2. This is especially true when the outcome is y_2 is close to the outcome y_1.

So, I really do care about accuracy, but I’m willing to trade off a bit of accuracy if it gets me more performance.

Blockquote
then we should be thinking about whether a good reparameterization is available.

Yes, I agree that this is the best path.

I’ve written code to find the mode of the posterior. This point has zero gradient and a negative definite Hessian. I was wondering about trying to find a reparameterization that makes the third derivative (in the direction of theta_beta_tilde[1] in the images) zero. Or perhaps another way of thinking about it, making the skewness of theta_beta_tilde[1] zero