Logistic model selection in rstan

I wanna do logistic regression in Rstan. I have 35 dependent variables. and about 18000 observations. I want use Rstan to select the best model!


First, I think (projpred) packages can help me find the best model. But when I use (stannarm), there some error:

Error: cannot allocate vector of size 2.5 Gb

my code is :

t_prior <- student_t(df = 7, location = 0, scale = 2.5)
post1 <- stan_glm(y ~ ., data = data,
                  family = binomial(link = "logit"), 
                  prior = t_prior, prior_intercept = t_prior, QR=TRUE,
                  seed = 14124869)

data.csv (5.2 MB)
and the left column is dummy variables y.

Some suggestions? Thanks~

Is your idea of model just a subset of the 35 predictors? That’s a space of 2^35 possible models, so it’s a bit too large to use model comparison techniques on. You can use something like horseshoe to estimate a single model with shrinkage and then truncate the small values. Or you can just leave them as they won’t do much and 35 variables isn’t too many to compute with (unless it’s a very high-performance setting, in which case you’re not using Stan anyway).

But that requires a different prior than the one you specify here. I don’t know if the horseshoe priors are an option for the stan_glm function, but then I’m not even sure which package that comes from (rstanarm?).

I’m also unclear on why you say y is a dummy variable. Isn’t that the observed data?

Very thanks for your reply! I have found the mistakes of my data that have some variable is still character.
I’m sorry to use a "dummy variable ", the y is definitely the observed data.
The variables usually have strong correlations, the model have strong collinearity.
And I use the projection predictive variable selection. the result is


Family: binomial 
Link function: logit 

Formula: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
    x12 + x13 + x14 + x15 + x16 + x17 + x18 + cg1 + cg2 + cg3 + 
    cg4 + cg5 + cg6 + cg7 + cg8 + nimtaavg + tlmta + exretavg + 
    sigma + cashmta
Observations: 21342
Projection method: traditional
CV method: LOO search not included 
Search method: l1, maximum number of terms 15
Number of clusters used for selection: 1
Number of clusters used for prediction: 10

Selection Summary:
 size solution_terms elpd.loo   se  diff diff.se
    0           <NA>   -328.4 41.5 -84.3    16.9
    1            x14   -274.6 34.5 -30.5    10.4
    2       exretavg   -261.0 33.0 -16.9     7.5
    3             x4   -246.1 31.1  -2.1     5.9
    4            x18   -244.1 30.8  -0.1     5.8
    5            x16   -243.0 30.7   1.1     5.6
    6             x9   -240.2 30.4   3.9     4.4
    7          sigma   -238.9 30.5   5.2     4.4
    8            x15   -235.9 30.2   8.2     4.1
    9            cg6   -233.4 29.8  10.6     4.2
   10             x7   -232.4 29.5  11.7     4.2
   11            cg5   -232.2 29.5  11.9     4.3
   12             x8   -231.9 29.5  12.1     4.3
   13            cg8   -230.4 29.4  13.7     4.2
   14          tlmta   -227.3 29.0  16.8     3.8
   15             x3   -227.4 29.0  16.7     3.7

And I plot the elpd and rmse:

The suggest size is 3 !
So I think the result is not true. But I can’t find the problem
The horseshoe prior I haven’t tried it , and If I have the results I will tell you!
very thanks~

Hi~ I use the horseshoe prior and it definitely uses more time to run!
here is my prior and result:

D <- ncol (x)
n <- nrow (x)
p0 <- 7 # prior guess for the number of relevant variables
sigma <- 1 / sqrt ( mean (y)*(1 - mean (y ))) # pseudo sigma
tau0 <- p0 /(D-p0) * sigma / sqrt (n)
prior_coeff <- hs(df =1, global_df =1, global_scale = tau0 ) # tau ~half - Cauchy (0, tau0 ^2)

post2 <- stan_glm(y ~ ., data = MM,
                  family = binomial(link = "logit"), 
                  prior = prior_coeff,prior_intercept = t_prior,
                  seed = 14124869)
Model Info:
 function:     stan_glm
 family:       binomial [logit]
 formula:      y ~ .
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 21342
 predictors:   32

Estimates:
              mean   sd    10%   50%   90%
(Intercept)  -6.2    1.6  -8.3  -6.1  -4.3
x1           -0.1    0.3  -0.4   0.0   0.1
x2           -0.1    0.3  -0.5   0.0   0.1
x3           -0.1    0.5  -0.6   0.0   0.2
x4           -5.1    7.0 -15.9  -1.0   0.1
x5            0.0    0.0   0.0   0.0   0.0
x6            0.0    0.0   0.0   0.0   0.0
x7            0.0    0.0   0.0   0.0   0.1
x8           -0.4    0.9  -1.7  -0.1   0.2
x9           -0.6    0.4  -1.2  -0.6   0.0
x10           0.1    0.9  -0.5   0.0   0.7
x11           0.2    0.2   0.0   0.2   0.5
x12          -0.8    0.4  -1.3  -0.8  -0.2
x13           0.0    0.3  -0.2   0.0   0.4
x14          -3.2    4.0  -9.6  -0.9   0.1
x15           0.1    0.1   0.0   0.1   0.2
x16           1.0    1.5  -0.1   0.2   3.4
x17           0.1    0.1   0.0   0.1   0.3
x18          -0.6    1.4  -2.2   0.0   0.2
cg1          -0.1    0.5  -0.7   0.0   0.2
cg2          -0.2    0.5  -0.7   0.0   0.2
cg3           0.0    0.0   0.0   0.0   0.0
cg4          -0.7    1.8  -2.3   0.0   0.2
cg5           0.2    0.7  -0.2   0.0   1.0
cg6          -0.6    0.5  -1.2  -0.5   0.0
cg7           0.0    0.3  -0.3   0.0   0.3
cg8          -0.7    1.6  -2.6  -0.1   0.2
nimtaavg     -0.1    1.2  -0.7   0.0   0.5
tlmta         0.8    0.9   0.0   0.5   2.2
exretavg     -7.3    2.7 -10.6  -7.3  -3.9
sigma         0.4    0.7  -0.1   0.1   1.4
cashmta      -1.9    2.8  -6.3  -0.4   0.1

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 0.0    0.0  0.0   0.0   0.0  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  1512 
x1            0.0  1.0  3034 
x2            0.0  1.0  3264 
x3            0.0  1.0  3246 
x4            0.3  1.0   589 
x5            0.0  1.0  4344 
x6            0.0  1.0  3058 
x7            0.0  1.0  2717 
x8            0.0  1.0  1430 
x9            0.0  1.0  1275 
x10           0.0  1.0  2878 
x11           0.0  1.0  1380 
x12           0.0  1.0  1098 
x13           0.0  1.0  3641 
x14           0.1  1.0   997 
x15           0.0  1.0  1200 
x16           0.0  1.0  1062 
x17           0.0  1.0  1725 
x18           0.0  1.0  1795 
cg1           0.0  1.0  2866 
cg2           0.0  1.0  2303 
cg3           0.0  1.0  3296 
cg4           0.0  1.0  1847 
cg5           0.0  1.0  2789 
cg6           0.0  1.0  1509 
cg7           0.0  1.0  3820 
cg8           0.0  1.0  1824 
nimtaavg      0.0  1.0  2835 
tlmta         0.0  1.0  1127 
exretavg      0.1  1.0  2083 
sigma         0.0  1.0  1810 
cashmta       0.1  1.0   661 
mean_PPD      0.0  1.0  4163 
log-posterior 0.3  1.0   723 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

And I also use the projpred to select variables:

Family: binomial 
Link function: logit 

Formula: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
    x12 + x13 + x14 + x15 + x16 + x17 + x18 + cg1 + cg2 + cg3 + 
    cg4 + cg5 + cg6 + cg7 + cg8 + nimtaavg + tlmta + exretavg + 
    sigma + cashmta
Observations: 21342
Projection method: traditional
CV method: LOO search not included 
Search method: l1, maximum number of terms 15
Number of clusters used for selection: 1
Number of clusters used for prediction: 10

Selection Summary:
 size solution_terms elpd.loo   se  diff diff.se
    0           <NA>   -328.4 41.4 -79.5    17.2
    1            x14   -272.6 34.5 -23.8     9.6
    2            x12   -272.9 34.9 -24.0     9.1
    3       exretavg   -259.7 33.4 -10.9     6.1
    4             x4   -245.4 31.6   3.4     4.3
    5            x18   -243.9 31.4   4.9     4.4
    6            x16   -242.7 31.3   6.2     4.4
    7            x15   -242.4 31.3   6.4     4.5
    8             x9   -238.0 30.9  10.8     3.8
    9        cashmta   -238.5 31.0  10.3     3.7
   10            cg6   -235.8 30.5  13.0     3.7
   11             x8   -235.8 30.6  13.1     3.6
   12            x17   -235.7 30.6  13.1     3.4
   13             x7   -234.7 30.4  14.1     3.5
   14             x6   -234.2 30.3  14.6     3.2
   15          sigma   -233.8 30.2  15.1     3.2

maybe the little (y==1) compare to large sample in my dataset caused that results?