I’m reaching out for ideas to improve performance issues related to quasi-complete separation encountered in about 1% of my datasets. The majority of the datasets work fine, but this small subset poses a challenge.
Model Overview:
- Standard logistic regression, utilizing QR decomposition for orthogonal predictors.
- Data compression through row counts (significant due to large datasets).
- Efficiency via binomial_logit.
data {
int<lower=0> num_predictors; // Number of predictors
int<lower=0> num_original_predictors; // Number of predictors *before* qr decomposition
int<lower=1> num_rows; // Number of rows
matrix[num_rows, num_predictors] predictors;
array[num_rows] int<lower=0, upper=1> y;
array[num_rows] int<lower=1> count; // Number of repetitions
matrix[num_original_predictors, num_predictors] r_star_inverse;
vector[num_original_predictors] beta_scale;
}
transformed data {
array[num_rows] int num_successes;
for (i in 1:num_rows) {
num_successes[i] = y[i] * count[i];
}
}
parameters {
real alpha;
vector[num_predictors] beta_tilde;
}
transformed parameters {
vector[num_original_predictors] beta = r_star_inverse * beta_tilde;
}
model {
vector[num_original_predictors] beta_mean = rep_vector(0, num_original_predictors);
beta ~ normal(beta_mean, beta_scale);
alpha ~ normal(0, 1);
target += binomial_logit_lpmf(num_successes | count, predictors * beta_tilde + alpha);
}
Challenge Faced:
With datasets showing quasi-complete separation, we observe low effective sample sizes. This is exemplified in the following dataset and results:
{
"beta_scale": [ 1.0 ],
"count": [ 100000, 100000, 50000
],
"num_original_predictors": 1,
"num_predictors": 1,
"num_rows": 3,
"predictors": [
[ -0.5 ],
[ -0.5 ],
[ 2.0 ]
],
"r_star_inverse": [
[ 2.4999999999999996 ]
],
"y": [ 0, 1, 0 ]
}
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -140000.0 0.0330 1.000 -140000.0 -140000.0 -140000.0 950.0 2400.0 1.0
alpha -1.7 0.0026 0.062 -1.8 -1.7 -1.6 580.0 1500.0 1.0
beta_tilde[1] -3.5 0.0051 0.120 -3.7 -3.5 -3.3 580.0 1454.0 1.0
beta[1] -8.7 0.0130 0.310 -9.2 -8.7 -8.2 580.0 1454.0 1.0
Correlations
alpha beta_tilde[1] beta[1]
alpha 1.000000 0.997565 0.997565
beta_tilde[1] 0.997565 1.000000 1.000000
beta[1] 0.997565 1.000000 1.000000
The correlation between alpha
and beta_tilde
is leading to inefficiency in sampling. As the number of predictors increases, issues like Maximum treedepth and R-hat warnings become more frequent.
Comparative Example
For contrast, datasets with perfect separation show better performance:
{
"beta_scale": [ 1.0 ],
"count": [ 150000, 100000 ],
"num_original_predictors": 1,
"num_predictors": 1,
"num_rows": 2,
"predictors": [
[ -0.8164965809277259 ],
[ 1.224744871391589 ]
],
"r_star_inverse": [
[ 2.0412414523193148 ]
],
"y": [ 0, 1 ]
}
which produces the following summary (note the higher ESS) and the much smaller correlations between alpha
and beta_tilde
:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -190.0 0.0250 1.00 -200.0 -190.0 -190.0 1600.0 26000.0 1.0
alpha -1.9 0.0033 0.17 -2.2 -1.9 -1.6 2600.0 42000.0 1.0
beta_tilde[1] 8.7 0.0029 0.16 8.4 8.7 8.9 2950.0 48360.0 1.0
beta[1] 18.0 0.0059 0.32 17.0 18.0 18.0 2950.0 48359.0 1.0
Correlations
alpha beta_tilde[1] beta[1]
alpha 1.000000 -0.118937 -0.118936
beta_tilde[1] 0.118937 1.000000 1.000000
beta[1] 0.118936 1.000000 1.000000
Seeking Suggestions:
- How might I address quasi-separation without removing
alpha
orbeta_tilde
? - Are there effective transformations to reduce correlation between
alpha
andbeta_tilde
? - Could a joint prior on these parameters be beneficial?
- Any other constraints or methods that might help?
I’m open to all kinds of suggestions, whether they’re based on conventional approaches or innovative ones. If there’s any additional information that would be helpful, please let me know.
Thanks in advance for any insights or advice!