Seeking Insights on Addressing Quasi-Complete Separation in Logistic Regression Models in Stan

I’m reaching out for ideas to improve performance issues related to quasi-complete separation encountered in about 1% of my datasets. The majority of the datasets work fine, but this small subset poses a challenge.

Model Overview:

  • Standard logistic regression, utilizing QR decomposition for orthogonal predictors.
  • Data compression through row counts (significant due to large datasets).
  • Efficiency via binomial_logit.
data {
  int<lower=0> num_predictors;  // Number of predictors
  int<lower=0> num_original_predictors;  // Number of predictors *before* qr decomposition
  int<lower=1> num_rows;  // Number of rows
  matrix[num_rows, num_predictors] predictors;
  array[num_rows] int<lower=0, upper=1> y;
  array[num_rows] int<lower=1> count;  // Number of repetitions
  matrix[num_original_predictors, num_predictors] r_star_inverse;
  vector[num_original_predictors] beta_scale;
}
transformed data {
  array[num_rows] int num_successes;
  for (i in 1:num_rows) {
    num_successes[i] = y[i] * count[i];
  }
}
parameters {
  real alpha;
  vector[num_predictors] beta_tilde;
}
transformed parameters {
  vector[num_original_predictors] beta = r_star_inverse * beta_tilde;
}
model {
  vector[num_original_predictors] beta_mean = rep_vector(0, num_original_predictors);
  beta ~ normal(beta_mean, beta_scale);
  alpha ~ normal(0, 1);

  target += binomial_logit_lpmf(num_successes | count, predictors * beta_tilde + alpha);
}

Challenge Faced:

With datasets showing quasi-complete separation, we observe low effective sample sizes. This is exemplified in the following dataset and results:

{
    "beta_scale": [ 1.0 ],
    "count": [ 100000, 100000, 50000
    ],
    "num_original_predictors": 1,
    "num_predictors": 1,
    "num_rows": 3,
    "predictors": [
        [ -0.5 ],
        [ -0.5 ],
        [ 2.0 ]
    ],
    "r_star_inverse": [
        [ 2.4999999999999996 ]
    ],
    "y": [ 0, 1, 0 ]
}
                   Mean    MCSE  StdDev        5%       50%       95%  N_Eff  N_Eff/s  R_hat
name
lp__          -140000.0  0.0330   1.000 -140000.0 -140000.0 -140000.0  950.0   2400.0    1.0
alpha              -1.7  0.0026   0.062      -1.8      -1.7      -1.6  580.0   1500.0    1.0
beta_tilde[1]      -3.5  0.0051   0.120      -3.7      -3.5      -3.3  580.0   1454.0    1.0
beta[1]            -8.7  0.0130   0.310      -9.2      -8.7      -8.2  580.0   1454.0    1.0

Correlations
                  alpha   beta_tilde[1]  beta[1]
alpha          1.000000       0.997565  0.997565
beta_tilde[1]  0.997565       1.000000  1.000000
beta[1]        0.997565       1.000000  1.000000

The correlation between alpha and beta_tilde is leading to inefficiency in sampling. As the number of predictors increases, issues like Maximum treedepth and R-hat warnings become more frequent.

Comparative Example

For contrast, datasets with perfect separation show better performance:

{
    "beta_scale": [ 1.0 ],
    "count": [ 150000, 100000 ],
    "num_original_predictors": 1,
    "num_predictors": 1,
    "num_rows": 2,
    "predictors": [
        [ -0.8164965809277259 ],
        [ 1.224744871391589 ]
    ],
    "r_star_inverse": [
        [ 2.0412414523193148 ]
    ],
    "y": [ 0, 1 ]
}

which produces the following summary (note the higher ESS) and the much smaller correlations between alpha and beta_tilde:

                Mean    MCSE  StdDev     5%    50%    95%   N_Eff  N_Eff/s  R_hat
name
lp__          -190.0  0.0250    1.00 -200.0 -190.0 -190.0  1600.0  26000.0    1.0
alpha           -1.9  0.0033    0.17   -2.2   -1.9   -1.6  2600.0  42000.0    1.0
beta_tilde[1]    8.7  0.0029    0.16    8.4    8.7    8.9  2950.0  48360.0    1.0
beta[1]         18.0  0.0059    0.32   17.0   18.0   18.0  2950.0  48359.0    1.0

Correlations
                 alpha  beta_tilde[1]   beta[1]
alpha         1.000000      -0.118937 -0.118936
beta_tilde[1] 0.118937       1.000000  1.000000
beta[1]       0.118936       1.000000  1.000000

Seeking Suggestions:

  • How might I address quasi-separation without removing alpha or beta_tilde?
  • Are there effective transformations to reduce correlation between alpha and beta_tilde?
  • Could a joint prior on these parameters be beneficial?
  • Any other constraints or methods that might help?

I’m open to all kinds of suggestions, whether they’re based on conventional approaches or innovative ones. If there’s any additional information that would be helpful, please let me know.

Thanks in advance for any insights or advice!

A post was split to a new topic: Advice on logistic regression

In this example, the model “knows” with very high precision what value the linear predictor should take when the covariate is -0.5, and also “knows” that the slope should be large and positive, but is very uncertain about just how large the slope should be. You can see intuitively why this would induce a massive correlation between the slope and the intercept (think about taking a line passing though (-.5, 0) with a large positive slope, and varying the slope by rotating the line around (-.5, 0)). The solution is to reparameterize in terms of the value of the linear predictor when the covariate is -0.5, rather than the intercept when the covariate is 0. Or equivalently, set the covariate to be [0, 0, 2.5] instead of [-0.5, -0.5, 2]

2 Likes

Thank you! The intuition you provided here is very helpful. I should be able to come up with a reasonable set of heuristics based on this.