I’m experiencing significant performance slowdowns in STAN when solving logistic regression problems with separable or near-separable data.
Performance is important, so we have implemented a few optimizations. We use QR decomposition to reduce the dependence between parameters, and we also exploit sufficient statistics to reduce the time we spend computing the gradient. These optimizations have a noticeable impact on performance, but we’re still struggling in a few corner cases.
Here is our current model.
data {
int<lower=0> num_predictors; // Number of predictors
int<lower=0> num_original_predictors; // Number of predictors *before* qr decomposition
int<lower=1> num_rows; // Number of rows
matrix[num_rows, num_predictors] predictors;
array[num_rows] int<lower=0> num_successes;
array[num_rows] int<lower=1> count; // Number of repetitions
matrix[num_original_predictors, num_predictors] r_star_inverse;
}
parameters {
vector[num_predictors] theta_beta_tilde;
}
transformed parameters {
vector[num_original_predictors] beta = r_star_inverse * theta_beta_tilde;
}
model {
theta_beta_tilde ~ student_t(9, 0, 1);
target += binomial_logit_lupmf(num_successes | count, predictors * theta_beta_tilde);
}
However, our model can run over 10-100x slower (3 hours vs 5 min) when there is separability or near-separability. Separability means that one or more of the predictors will diverge to infinity. Only the inclusion of a prior will ensure that the problem is well-defined.
However, even when our model is well-defined, the shape of the posterior seems to give STAN some difficulties.
It seems that we have a funnel shape posterior, that is well-known to cause STAN problems.
I would love to hear approaches for how I can speed up model solving when there is separability.
Here are some ideas that I’ve had:
Transforming the parameters
My guess as to why model-fitting is so slow is that curvature of the posterior along theta_beta_tilde[1]
is high variable. STAN is spending equal time sampling from a small area with high likelihood and a much larger area with lower likelihood. However, I’m not sure what a good transform would be. Further, because the curvature is coming from the data and not the model, we’d have to numerical compute a transform.
Special-casing Separability
Maybe there’s another model that we can be using when we detect the separability? Is so, what would this be?
Adjusting Priors
Are there priors that would provide better shaped posteriors? I’ve played with various degrees of freedom for the Student t prior (including normal), and while some run faster than others, I haven’t yet found one that was “fast”
Adjusting Hamiltonian Monte Carlo fitting
Are there parameters that I can provide to STAN that would help fit this type of data better?
How you can help
Eager to hear what you all think of these ideas, or to hear what other ideas might help.