Logistic model selection in rstan

I wanna do logistic regression in Rstan. I have 35 dependent variables. and about 18000 observations. I want use Rstan to select the best model!


First, I think (projpred) packages can help me find the best model. But when I use (stannarm), there some error:

Error: cannot allocate vector of size 2.5 Gb

my code is :

t_prior <- student_t(df = 7, location = 0, scale = 2.5)
post1 <- stan_glm(y ~ ., data = data,
                  family = binomial(link = "logit"), 
                  prior = t_prior, prior_intercept = t_prior, QR=TRUE,
                  seed = 14124869)

data.csv (5.2 MB)
and the left column is dummy variables y.

Some suggestions? Thanks~

Is your idea of model just a subset of the 35 predictors? That’s a space of 2^35 possible models, so it’s a bit too large to use model comparison techniques on. You can use something like horseshoe to estimate a single model with shrinkage and then truncate the small values. Or you can just leave them as they won’t do much and 35 variables isn’t too many to compute with (unless it’s a very high-performance setting, in which case you’re not using Stan anyway).

But that requires a different prior than the one you specify here. I don’t know if the horseshoe priors are an option for the stan_glm function, but then I’m not even sure which package that comes from (rstanarm?).

I’m also unclear on why you say y is a dummy variable. Isn’t that the observed data?

Very thanks for your reply! I have found the mistakes of my data that have some variable is still character.
I’m sorry to use a "dummy variable ", the y is definitely the observed data.
The variables usually have strong correlations, the model have strong collinearity.
And I use the projection predictive variable selection. the result is


Family: binomial 
Link function: logit 

Formula: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
    x12 + x13 + x14 + x15 + x16 + x17 + x18 + cg1 + cg2 + cg3 + 
    cg4 + cg5 + cg6 + cg7 + cg8 + nimtaavg + tlmta + exretavg + 
    sigma + cashmta
Observations: 21342
Projection method: traditional
CV method: LOO search not included 
Search method: l1, maximum number of terms 15
Number of clusters used for selection: 1
Number of clusters used for prediction: 10

Selection Summary:
 size solution_terms elpd.loo   se  diff diff.se
    0           <NA>   -328.4 41.5 -84.3    16.9
    1            x14   -274.6 34.5 -30.5    10.4
    2       exretavg   -261.0 33.0 -16.9     7.5
    3             x4   -246.1 31.1  -2.1     5.9
    4            x18   -244.1 30.8  -0.1     5.8
    5            x16   -243.0 30.7   1.1     5.6
    6             x9   -240.2 30.4   3.9     4.4
    7          sigma   -238.9 30.5   5.2     4.4
    8            x15   -235.9 30.2   8.2     4.1
    9            cg6   -233.4 29.8  10.6     4.2
   10             x7   -232.4 29.5  11.7     4.2
   11            cg5   -232.2 29.5  11.9     4.3
   12             x8   -231.9 29.5  12.1     4.3
   13            cg8   -230.4 29.4  13.7     4.2
   14          tlmta   -227.3 29.0  16.8     3.8
   15             x3   -227.4 29.0  16.7     3.7

And I plot the elpd and rmse:

The suggest size is 3 !
So I think the result is not true. But I can’t find the problem
The horseshoe prior I haven’t tried it , and If I have the results I will tell you!
very thanks~

Hi~ I use the horseshoe prior and it definitely uses more time to run!
here is my prior and result:

D <- ncol (x)
n <- nrow (x)
p0 <- 7 # prior guess for the number of relevant variables
sigma <- 1 / sqrt ( mean (y)*(1 - mean (y ))) # pseudo sigma
tau0 <- p0 /(D-p0) * sigma / sqrt (n)
prior_coeff <- hs(df =1, global_df =1, global_scale = tau0 ) # tau ~half - Cauchy (0, tau0 ^2)

post2 <- stan_glm(y ~ ., data = MM,
                  family = binomial(link = "logit"), 
                  prior = prior_coeff,prior_intercept = t_prior,
                  seed = 14124869)
Model Info:
 function:     stan_glm
 family:       binomial [logit]
 formula:      y ~ .
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 21342
 predictors:   32

Estimates:
              mean   sd    10%   50%   90%
(Intercept)  -6.2    1.6  -8.3  -6.1  -4.3
x1           -0.1    0.3  -0.4   0.0   0.1
x2           -0.1    0.3  -0.5   0.0   0.1
x3           -0.1    0.5  -0.6   0.0   0.2
x4           -5.1    7.0 -15.9  -1.0   0.1
x5            0.0    0.0   0.0   0.0   0.0
x6            0.0    0.0   0.0   0.0   0.0
x7            0.0    0.0   0.0   0.0   0.1
x8           -0.4    0.9  -1.7  -0.1   0.2
x9           -0.6    0.4  -1.2  -0.6   0.0
x10           0.1    0.9  -0.5   0.0   0.7
x11           0.2    0.2   0.0   0.2   0.5
x12          -0.8    0.4  -1.3  -0.8  -0.2
x13           0.0    0.3  -0.2   0.0   0.4
x14          -3.2    4.0  -9.6  -0.9   0.1
x15           0.1    0.1   0.0   0.1   0.2
x16           1.0    1.5  -0.1   0.2   3.4
x17           0.1    0.1   0.0   0.1   0.3
x18          -0.6    1.4  -2.2   0.0   0.2
cg1          -0.1    0.5  -0.7   0.0   0.2
cg2          -0.2    0.5  -0.7   0.0   0.2
cg3           0.0    0.0   0.0   0.0   0.0
cg4          -0.7    1.8  -2.3   0.0   0.2
cg5           0.2    0.7  -0.2   0.0   1.0
cg6          -0.6    0.5  -1.2  -0.5   0.0
cg7           0.0    0.3  -0.3   0.0   0.3
cg8          -0.7    1.6  -2.6  -0.1   0.2
nimtaavg     -0.1    1.2  -0.7   0.0   0.5
tlmta         0.8    0.9   0.0   0.5   2.2
exretavg     -7.3    2.7 -10.6  -7.3  -3.9
sigma         0.4    0.7  -0.1   0.1   1.4
cashmta      -1.9    2.8  -6.3  -0.4   0.1

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 0.0    0.0  0.0   0.0   0.0  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  1512 
x1            0.0  1.0  3034 
x2            0.0  1.0  3264 
x3            0.0  1.0  3246 
x4            0.3  1.0   589 
x5            0.0  1.0  4344 
x6            0.0  1.0  3058 
x7            0.0  1.0  2717 
x8            0.0  1.0  1430 
x9            0.0  1.0  1275 
x10           0.0  1.0  2878 
x11           0.0  1.0  1380 
x12           0.0  1.0  1098 
x13           0.0  1.0  3641 
x14           0.1  1.0   997 
x15           0.0  1.0  1200 
x16           0.0  1.0  1062 
x17           0.0  1.0  1725 
x18           0.0  1.0  1795 
cg1           0.0  1.0  2866 
cg2           0.0  1.0  2303 
cg3           0.0  1.0  3296 
cg4           0.0  1.0  1847 
cg5           0.0  1.0  2789 
cg6           0.0  1.0  1509 
cg7           0.0  1.0  3820 
cg8           0.0  1.0  1824 
nimtaavg      0.0  1.0  2835 
tlmta         0.0  1.0  1127 
exretavg      0.1  1.0  2083 
sigma         0.0  1.0  1810 
cashmta       0.1  1.0   661 
mean_PPD      0.0  1.0  4163 
log-posterior 0.3  1.0   723 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

And I also use the projpred to select variables:

Family: binomial 
Link function: logit 

Formula: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
    x12 + x13 + x14 + x15 + x16 + x17 + x18 + cg1 + cg2 + cg3 + 
    cg4 + cg5 + cg6 + cg7 + cg8 + nimtaavg + tlmta + exretavg + 
    sigma + cashmta
Observations: 21342
Projection method: traditional
CV method: LOO search not included 
Search method: l1, maximum number of terms 15
Number of clusters used for selection: 1
Number of clusters used for prediction: 10

Selection Summary:
 size solution_terms elpd.loo   se  diff diff.se
    0           <NA>   -328.4 41.4 -79.5    17.2
    1            x14   -272.6 34.5 -23.8     9.6
    2            x12   -272.9 34.9 -24.0     9.1
    3       exretavg   -259.7 33.4 -10.9     6.1
    4             x4   -245.4 31.6   3.4     4.3
    5            x18   -243.9 31.4   4.9     4.4
    6            x16   -242.7 31.3   6.2     4.4
    7            x15   -242.4 31.3   6.4     4.5
    8             x9   -238.0 30.9  10.8     3.8
    9        cashmta   -238.5 31.0  10.3     3.7
   10            cg6   -235.8 30.5  13.0     3.7
   11             x8   -235.8 30.6  13.1     3.6
   12            x17   -235.7 30.6  13.1     3.4
   13             x7   -234.7 30.4  14.1     3.5
   14             x6   -234.2 30.3  14.6     3.2
   15          sigma   -233.8 30.2  15.1     3.2

maybe the little (y==1) compare to large sample in my dataset caused that results?

Just some food for thought based on my experience with the projpred() package. It’s a fantastic feature selection tool, although you may want to play around with the HS prior, as it can pretty aggressively regularize your estimates. This is generally a feature, not a bug, but it may be why it doesn’t match your intuitions on your expected outcome (although I would default to trusting it, as sometimes the regular features of our data aren’t what we expected, and we are fooled by sparsity and irregularities). Typically, the number of rows/features won’t be a problem in a properly posed model run in Stan, other than possibly providing estimates with more uncertainty than you may expect to see. Do you have any more info on why you believe the answer you’re getting is ‘wrong’ or why the results aren’t true?

Thanks for your reply!
I agree with your suggestions. I will make some comparison with other models!
And When I use the cv_varsel function and set validate_search=TRUE, it will take a long time to have a result.
I want to know if I set the validate_search=FALSE can I believe the results?

1 Like

No problem! My experience with Stan, and Bayesian modelling in general, is that if you have the time, it’s always worth doing the ‘full’ option versus any approximation, whether that be an mcmc vs. a variational approximation, or in this case, for cross validation. On the package homepage, they provide the following.

There are two functions for performing the variable selection: varsel() and cv_varsel() . In contrast tovarsel() , cv_varsel() performs a cross-validation (CV) by running the search part with the training data of each CV fold separately (an exception is validate_search = FALSE , see ?cv_varsel and below) and running the evaluation part on the corresponding test set of each CV fold. Because of this CV, cv_varsel() is recommended over varsel() . Thus, we use cv_varsel() here. Nonetheless, running varsel() first can offer a rough idea of the performance of the submodels (after projecting the reference model onto them). A more principled projpred workflow is work under progress.

I think this is pretty solid advice to follow. If you’re just testing out several different model formations, the varsel() option is fine, but for any sort of downstream analysis or ‘final’ analysis, the cv_varsel is worth the wait to get the proper cross validation.

Best of luck.

thanks for your reply!
I actually use the cv_varsel() and set the validate_search =TRUE
But the result make me confused.

Family: binomial 
Link function: logit 

Formula: y ~ x1 + x2 + x3 + x5 + x6 + x8 + x9 + x10 + x11 + x12 + x13 + 
    x15 + x16 + x17 + cg1 + cg2 + cg5 + cg6 + cg9 + nimtaavg + 
    exretavg + sigma + rsize + cashmta + mb
Observations: 11071
Projection method: traditional
CV method: LOO search included 
Search method: l1, maximum number of terms 15
Number of clusters used for selection: 1
Number of clusters used for prediction: 10

Selection Summary:
 size solution_terms elpd.loo   se diff diff.se
    0           <NA>   -352.9 39.5 -0.5     0.1
    1            x12   -352.7 39.5 -0.3     0.1
    2            x13   -352.8 39.5 -0.4     0.1
    3            x16   -352.8 39.5 -0.4     0.1
    4             x5   -352.7 39.5 -0.3     0.1
    5            x10   -352.7 39.5 -0.3     0.1
    6       exretavg   -352.7 39.4 -0.3     0.1
    7          sigma   -352.5 39.4 -0.1     0.1
    8             x6   -352.5 39.4 -0.1     0.1
    9             x8   -352.5 39.4 -0.1     0.1
   10            cg6   -352.5 39.4 -0.1     0.1
   11            cg9   -352.5 39.4 -0.1     0.1
   12            x15   -352.5 39.4 -0.1     0.1
   13             mb   -352.5 39.4 -0.1     0.1
   14            cg1   -352.5 39.4 -0.1     0.1
   15            cg5   -352.5 39.4 -0.1     0.1

and I use the suggest_size() got the result of NA

Warning message:
In suggest_size.vsel(cvvs2) :
  Could not suggest submodel size. Investigate plot.vsel() to identify if the search was terminated too early. If this is the case, run variable selection with larger value for `nterms_max`.

I think you need to look at your model a bit more carefully - this suggests that there is no optimal number of features, that nothing actually beats a model with no variables; there is no predictive power out-of-bag whatsoever. I think you may want to start with a smaller model, and possibly without your sample data - simulations can really help uncover modelling problems.