# Seeking Insights on Addressing Quasi-Complete Separation in Logistic Regression Models in Stan

I’m reaching out for ideas to improve performance issues related to quasi-complete separation encountered in about 1% of my datasets. The majority of the datasets work fine, but this small subset poses a challenge.

Model Overview:

• Standard logistic regression, utilizing QR decomposition for orthogonal predictors.
• Data compression through row counts (significant due to large datasets).
• Efficiency via binomial_logit.
``````data {
int<lower=0> num_predictors;  // Number of predictors
int<lower=0> num_original_predictors;  // Number of predictors *before* qr decomposition
int<lower=1> num_rows;  // Number of rows
matrix[num_rows, num_predictors] predictors;
array[num_rows] int<lower=0, upper=1> y;
array[num_rows] int<lower=1> count;  // Number of repetitions
matrix[num_original_predictors, num_predictors] r_star_inverse;
vector[num_original_predictors] beta_scale;
}
transformed data {
array[num_rows] int num_successes;
for (i in 1:num_rows) {
num_successes[i] = y[i] * count[i];
}
}
parameters {
real alpha;
vector[num_predictors] beta_tilde;
}
transformed parameters {
vector[num_original_predictors] beta = r_star_inverse * beta_tilde;
}
model {
vector[num_original_predictors] beta_mean = rep_vector(0, num_original_predictors);
beta ~ normal(beta_mean, beta_scale);
alpha ~ normal(0, 1);

target += binomial_logit_lpmf(num_successes | count, predictors * beta_tilde + alpha);
}

``````

Challenge Faced:

With datasets showing quasi-complete separation, we observe low effective sample sizes. This is exemplified in the following dataset and results:

``````{
"beta_scale": [ 1.0 ],
"count": [ 100000, 100000, 50000
],
"num_original_predictors": 1,
"num_predictors": 1,
"num_rows": 3,
"predictors": [
[ -0.5 ],
[ -0.5 ],
[ 2.0 ]
],
"r_star_inverse": [
[ 2.4999999999999996 ]
],
"y": [ 0, 1, 0 ]
}
``````
``````                   Mean    MCSE  StdDev        5%       50%       95%  N_Eff  N_Eff/s  R_hat
name
lp__          -140000.0  0.0330   1.000 -140000.0 -140000.0 -140000.0  950.0   2400.0    1.0
alpha              -1.7  0.0026   0.062      -1.8      -1.7      -1.6  580.0   1500.0    1.0
beta_tilde[1]      -3.5  0.0051   0.120      -3.7      -3.5      -3.3  580.0   1454.0    1.0
beta[1]            -8.7  0.0130   0.310      -9.2      -8.7      -8.2  580.0   1454.0    1.0

Correlations
alpha   beta_tilde[1]  beta[1]
alpha          1.000000       0.997565  0.997565
beta_tilde[1]  0.997565       1.000000  1.000000
beta[1]        0.997565       1.000000  1.000000
``````

The correlation between `alpha` and `beta_tilde` is leading to inefficiency in sampling. As the number of predictors increases, issues like Maximum treedepth and R-hat warnings become more frequent.

Comparative Example

For contrast, datasets with perfect separation show better performance:

``````{
"beta_scale": [ 1.0 ],
"count": [ 150000, 100000 ],
"num_original_predictors": 1,
"num_predictors": 1,
"num_rows": 2,
"predictors": [
[ -0.8164965809277259 ],
[ 1.224744871391589 ]
],
"r_star_inverse": [
[ 2.0412414523193148 ]
],
"y": [ 0, 1 ]
}
``````

which produces the following summary (note the higher ESS) and the much smaller correlations between `alpha` and `beta_tilde`:

``````                Mean    MCSE  StdDev     5%    50%    95%   N_Eff  N_Eff/s  R_hat
name
lp__          -190.0  0.0250    1.00 -200.0 -190.0 -190.0  1600.0  26000.0    1.0
alpha           -1.9  0.0033    0.17   -2.2   -1.9   -1.6  2600.0  42000.0    1.0
beta_tilde[1]    8.7  0.0029    0.16    8.4    8.7    8.9  2950.0  48360.0    1.0
beta[1]         18.0  0.0059    0.32   17.0   18.0   18.0  2950.0  48359.0    1.0

Correlations
alpha  beta_tilde[1]   beta[1]
alpha         1.000000      -0.118937 -0.118936
beta_tilde[1] 0.118937       1.000000  1.000000
beta[1]       0.118936       1.000000  1.000000
``````

Seeking Suggestions:

• How might I address quasi-separation without removing `alpha` or `beta_tilde`?
• Are there effective transformations to reduce correlation between `alpha` and `beta_tilde`?
• Could a joint prior on these parameters be beneficial?
• Any other constraints or methods that might help?

I’m open to all kinds of suggestions, whether they’re based on conventional approaches or innovative ones. If there’s any additional information that would be helpful, please let me know.

In this example, the model “knows” with very high precision what value the linear predictor should take when the covariate is -0.5, and also “knows” that the slope should be large and positive, but is very uncertain about just how large the slope should be. You can see intuitively why this would induce a massive correlation between the slope and the intercept (think about taking a line passing though `(-.5, 0)` with a large positive slope, and varying the slope by rotating the line around `(-.5, 0)`). The solution is to reparameterize in terms of the value of the linear predictor when the covariate is -0.5, rather than the intercept when the covariate is 0. Or equivalently, set the covariate to be `[0, 0, 2.5]` instead of `[-0.5, -0.5, 2]`