I wanna do logistic regression in Rstan. I have 35 dependent variables. and about 18000 observations. I want use Rstan to select the best model!
First, I think (projpred) packages can help me find the best model. But when I use (stannarm), there some error:
Error: cannot allocate vector of size 2.5 Gb
my code is :
t_prior <- student_t(df = 7, location = 0, scale = 2.5)
post1 <- stan_glm(y ~ ., data = data,
family = binomial(link = "logit"),
prior = t_prior, prior_intercept = t_prior, QR=TRUE,
seed = 14124869)
data.csv (5.2 MB)
and the left column is dummy variables y.
Some suggestions? Thanks~
Is your idea of model just a subset of the 35 predictors? That’s a space of 2^35 possible models, so it’s a bit too large to use model comparison techniques on. You can use something like horseshoe to estimate a single model with shrinkage and then truncate the small values. Or you can just leave them as they won’t do much and 35 variables isn’t too many to compute with (unless it’s a very high-performance setting, in which case you’re not using Stan anyway).
But that requires a different prior than the one you specify here. I don’t know if the horseshoe priors are an option for the stan_glm
function, but then I’m not even sure which package that comes from (rstanarm?).
I’m also unclear on why you say y
is a dummy variable. Isn’t that the observed data?
Very thanks for your reply! I have found the mistakes of my data that have some variable is still character.
I’m sorry to use a "dummy variable ", the y is definitely the observed data.
The variables usually have strong correlations, the model have strong collinearity.
And I use the projection predictive variable selection. the result is
Family: binomial
Link function: logit
Formula: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + cg1 + cg2 + cg3 +
cg4 + cg5 + cg6 + cg7 + cg8 + nimtaavg + tlmta + exretavg +
sigma + cashmta
Observations: 21342
Projection method: traditional
CV method: LOO search not included
Search method: l1, maximum number of terms 15
Number of clusters used for selection: 1
Number of clusters used for prediction: 10
Selection Summary:
size solution_terms elpd.loo se diff diff.se
0 <NA> -328.4 41.5 -84.3 16.9
1 x14 -274.6 34.5 -30.5 10.4
2 exretavg -261.0 33.0 -16.9 7.5
3 x4 -246.1 31.1 -2.1 5.9
4 x18 -244.1 30.8 -0.1 5.8
5 x16 -243.0 30.7 1.1 5.6
6 x9 -240.2 30.4 3.9 4.4
7 sigma -238.9 30.5 5.2 4.4
8 x15 -235.9 30.2 8.2 4.1
9 cg6 -233.4 29.8 10.6 4.2
10 x7 -232.4 29.5 11.7 4.2
11 cg5 -232.2 29.5 11.9 4.3
12 x8 -231.9 29.5 12.1 4.3
13 cg8 -230.4 29.4 13.7 4.2
14 tlmta -227.3 29.0 16.8 3.8
15 x3 -227.4 29.0 16.7 3.7
And I plot the elpd and rmse:
The suggest size is 3 !
So I think the result is not true. But I can’t find the problem
The horseshoe prior I haven’t tried it , and If I have the results I will tell you!
very thanks~
Hi~ I use the horseshoe prior and it definitely uses more time to run!
here is my prior and result:
D <- ncol (x)
n <- nrow (x)
p0 <- 7 # prior guess for the number of relevant variables
sigma <- 1 / sqrt ( mean (y)*(1 - mean (y ))) # pseudo sigma
tau0 <- p0 /(D-p0) * sigma / sqrt (n)
prior_coeff <- hs(df =1, global_df =1, global_scale = tau0 ) # tau ~half - Cauchy (0, tau0 ^2)
post2 <- stan_glm(y ~ ., data = MM,
family = binomial(link = "logit"),
prior = prior_coeff,prior_intercept = t_prior,
seed = 14124869)
Model Info:
function: stan_glm
family: binomial [logit]
formula: y ~ .
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 21342
predictors: 32
Estimates:
mean sd 10% 50% 90%
(Intercept) -6.2 1.6 -8.3 -6.1 -4.3
x1 -0.1 0.3 -0.4 0.0 0.1
x2 -0.1 0.3 -0.5 0.0 0.1
x3 -0.1 0.5 -0.6 0.0 0.2
x4 -5.1 7.0 -15.9 -1.0 0.1
x5 0.0 0.0 0.0 0.0 0.0
x6 0.0 0.0 0.0 0.0 0.0
x7 0.0 0.0 0.0 0.0 0.1
x8 -0.4 0.9 -1.7 -0.1 0.2
x9 -0.6 0.4 -1.2 -0.6 0.0
x10 0.1 0.9 -0.5 0.0 0.7
x11 0.2 0.2 0.0 0.2 0.5
x12 -0.8 0.4 -1.3 -0.8 -0.2
x13 0.0 0.3 -0.2 0.0 0.4
x14 -3.2 4.0 -9.6 -0.9 0.1
x15 0.1 0.1 0.0 0.1 0.2
x16 1.0 1.5 -0.1 0.2 3.4
x17 0.1 0.1 0.0 0.1 0.3
x18 -0.6 1.4 -2.2 0.0 0.2
cg1 -0.1 0.5 -0.7 0.0 0.2
cg2 -0.2 0.5 -0.7 0.0 0.2
cg3 0.0 0.0 0.0 0.0 0.0
cg4 -0.7 1.8 -2.3 0.0 0.2
cg5 0.2 0.7 -0.2 0.0 1.0
cg6 -0.6 0.5 -1.2 -0.5 0.0
cg7 0.0 0.3 -0.3 0.0 0.3
cg8 -0.7 1.6 -2.6 -0.1 0.2
nimtaavg -0.1 1.2 -0.7 0.0 0.5
tlmta 0.8 0.9 0.0 0.5 2.2
exretavg -7.3 2.7 -10.6 -7.3 -3.9
sigma 0.4 0.7 -0.1 0.1 1.4
cashmta -1.9 2.8 -6.3 -0.4 0.1
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.0 0.0 0.0 0.0 0.0
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 1512
x1 0.0 1.0 3034
x2 0.0 1.0 3264
x3 0.0 1.0 3246
x4 0.3 1.0 589
x5 0.0 1.0 4344
x6 0.0 1.0 3058
x7 0.0 1.0 2717
x8 0.0 1.0 1430
x9 0.0 1.0 1275
x10 0.0 1.0 2878
x11 0.0 1.0 1380
x12 0.0 1.0 1098
x13 0.0 1.0 3641
x14 0.1 1.0 997
x15 0.0 1.0 1200
x16 0.0 1.0 1062
x17 0.0 1.0 1725
x18 0.0 1.0 1795
cg1 0.0 1.0 2866
cg2 0.0 1.0 2303
cg3 0.0 1.0 3296
cg4 0.0 1.0 1847
cg5 0.0 1.0 2789
cg6 0.0 1.0 1509
cg7 0.0 1.0 3820
cg8 0.0 1.0 1824
nimtaavg 0.0 1.0 2835
tlmta 0.0 1.0 1127
exretavg 0.1 1.0 2083
sigma 0.0 1.0 1810
cashmta 0.1 1.0 661
mean_PPD 0.0 1.0 4163
log-posterior 0.3 1.0 723
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
And I also use the projpred to select variables:
Family: binomial
Link function: logit
Formula: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 +
x12 + x13 + x14 + x15 + x16 + x17 + x18 + cg1 + cg2 + cg3 +
cg4 + cg5 + cg6 + cg7 + cg8 + nimtaavg + tlmta + exretavg +
sigma + cashmta
Observations: 21342
Projection method: traditional
CV method: LOO search not included
Search method: l1, maximum number of terms 15
Number of clusters used for selection: 1
Number of clusters used for prediction: 10
Selection Summary:
size solution_terms elpd.loo se diff diff.se
0 <NA> -328.4 41.4 -79.5 17.2
1 x14 -272.6 34.5 -23.8 9.6
2 x12 -272.9 34.9 -24.0 9.1
3 exretavg -259.7 33.4 -10.9 6.1
4 x4 -245.4 31.6 3.4 4.3
5 x18 -243.9 31.4 4.9 4.4
6 x16 -242.7 31.3 6.2 4.4
7 x15 -242.4 31.3 6.4 4.5
8 x9 -238.0 30.9 10.8 3.8
9 cashmta -238.5 31.0 10.3 3.7
10 cg6 -235.8 30.5 13.0 3.7
11 x8 -235.8 30.6 13.1 3.6
12 x17 -235.7 30.6 13.1 3.4
13 x7 -234.7 30.4 14.1 3.5
14 x6 -234.2 30.3 14.6 3.2
15 sigma -233.8 30.2 15.1 3.2
maybe the little (y==1) compare to large sample in my dataset caused that results?