Hey everyone,
I’m still at the beginning of learning Bayesian statistics and Stan. So please excuse me if something in my post or code makes little or no sense :) I’m pretty sure my code is not the cleanest and efficient code possible, but I tried my best.
For a research project, we try to fit a linear regression model with a mixture prior on the coefficients vector. The aim of our project is to identify patterns in the coefficients and to identify clusters of variables which have a similar effect. So, we want to estimate both, the coefficients and the mu and sigma parameters of the mixture components. Further, we add an additional mixture component with a mean restricted to 0 to cluster variables without an effect into this component. Our original data set consists of around 3,700 observations and around 1,350 predictors on the same scale with a handful additional control variables, which we would exclude from the clustering process. We assume that only a few hundred of the 1,350 predictors will actually have a significant effect.
Our model is mainly based on Yengo, Jacques, Biernacki & Canouil (2016): “Variable Clustering in High-Dimensional Linear Regression: The R Package clere”, The R Journal, 8 (1), 92-106 and Moser, Lee, Hayes, Goddard, Wray & Visscher (2015), “Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model”, PLoS Genet, 11 (4), 1-22.
The first paper assumes different values for the mu’s and the same sigma for the different components. It is not a fully Bayesian but Empirical Bayesian framework estimated by EM-algorithm which we would extend to a fully Bayesian framework. The second paper restricts the mu for all components to 0 and assumes different sigmas for the components (i.e., there is only one sigma parameter estimated, which is multiplied by a factor which varies across components). However, we are interested in both, varying mus and sigmas.
With some toy data the first version of the model with two components plus an additional component with the mu restricted to 0 runs ok most of the time.
The only problem so far is that I sometimes faced some weird sampling problems after varying the toy data by rerunning the code below without setting any seeds. After setting seed manually, for some seed values the model runs perfect, for others n_eff and Rhats are awful for the mixture parameters.
I use R Studio with the rstan package on Windows 10. The session info is the following:
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_Germany.1252 LC_CTYPE=English_Germany.1252 LC_MONETARY=English_Germany.1252
[4] LC_NUMERIC=C LC_TIME=English_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2 readr_1.3.1
[6] tidyr_0.8.3 tibble_2.1.3 tidyverse_1.2.1 rstan_2.19.2 ggplot2_3.2.0
[11] StanHeaders_2.18.1-10
loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 haven_2.1.1 lattice_0.20-38 colorspace_1.4-1 generics_0.0.2 vctrs_0.2.0
[7] stats4_3.6.1 loo_2.1.0 rlang_0.4.0 pkgbuild_1.0.3 pillar_1.4.2 glue_1.3.1
[13] withr_2.1.2 readxl_1.3.1 modelr_0.1.4 matrixStats_0.54.0 cellranger_1.1.0 munsell_0.5.0
[19] gtable_0.3.0 rvest_0.3.4 codetools_0.2-16 inline_0.3.15 callr_3.3.1 ps_1.3.0
[25] parallel_3.6.1 broom_0.5.2 Rcpp_1.0.2 scales_1.0.0 backports_1.1.4 jsonlite_1.6
[31] gridExtra_2.3 hms_0.5.0 stringi_1.4.3 processx_3.4.1 grid_3.6.1 cli_1.1.0
[37] tools_3.6.1 magrittr_1.5 lazyeval_0.2.2 crayon_1.3.4 pkgconfig_2.0.2 zeallot_0.1.0
[43] xml2_1.2.1 prettyunits_1.0.2 lubridate_1.7.4 assertthat_0.2.1 httr_1.4.0 rstudioapi_0.10
[49] R6_2.4.0 nlme_3.1-140 compiler_3.6.1
The code for data and the model I run is the following. I know that the (hyper-)priors are pretty weakly informative, but it works fine this way.
library(rstan)
library(tidyverse)
# Run a clusterwise effects regression with zero coefficients with a high number of regressors
#and varying sigma for each mixture component
options(max.print = 100000,
scipen = 999999)
# Generate data
set.seed(1) #set.seed(1) – set.seed(10)
n <- 500 #observations
p <- 300 #number of x variables excl. intercept
real_p <- 150 #number of "true" predictors which have an effect
mu_mix1 <- -5
mu_mix2 <- 3
sigma_mix1 <- 2
sigma_mix2 <- 1
sigma_mix3 <- 0.2
sigma <- 2
X <- matrix(rnorm(n*p, 10, 1), nrow=n, ncol=p)
X <- cbind(1, X) %>% data.matrix()
b <- c(5,
rnorm(real_p/2, mu_mix1, sigma_mix1), #real_p/2 predictors have a positive effect around mu_mix1,
rnorm(real_p/2, mu_mix2, sigma_mix2), #real_p/2 predictors have a negative effect around mu_mix2,
rnorm(p-real_p, 0, sigma_mix3)) #all the other predictors have an effect around 0
#implies theta proportions: component 1 = 0.25, component 2 = 0.25, component 3 = 0.5
y <- as.numeric(X %*% b + rnorm(n, 0, sigma))
#Stan data
standata = list(N = nrow(X),
K = ncol(X[,-1]),
Y = as.numeric(y),
X = X[,-1],
G = 3)
# Model in Stan
stan_code_regression="data {
int<lower=1> N; // number of observations
vector[N] Y; // response variable
int<lower=1> K; // number of predictors
matrix[N, K] X; // predictor matrix
int<lower=1> G; // number of groups
}
transformed data {
matrix[N, K] Xc; // centered version of X
vector[K] means_X; // column means of X before centering
vector[N] Yc; // centered version of Y
real mean_Y; // mean of Y before centering
for (i in 1:K) {
means_X[i] = mean(X[, i]);
Xc[, i] = X[, i] - means_X[i];
}
mean_Y = mean(Y);
Yc = Y - mean_Y;
}
parameters {
vector [K] b; // coefficients
simplex [G] theta; // theta for the mixture
ordered [G-1] mu_mix; // mus for the mixture
vector [G-1] mu_mix_location_raw; // location for mu_mix
vector <lower=0> [G-1] mu_mix_scale_raw; // scale for mu_mix
vector <lower=0> [G] sigma_mix_raw; //sigma for the mixture
vector <lower=0> [G] sigma_mix_location_raw; // location for sigma_mix
vector <lower=0> [G] sigma_mix_scale_raw; // scale for sigma_mix
real <lower=0> sigma_raw; // residual SD
real <lower=0> sigma_location_raw; // location for sigma
real <lower=0> sigma_scale_raw; // scale for sigma
}
transformed parameters {
vector[G] sigma_mix;
real sigma;
vector[G-1] mu_mix_location;
vector[G] sigma_mix_location;
real sigma_location;
vector[G-1] mu_mix_scale;
vector[G] sigma_mix_scale;
real sigma_scale;
//non centered parameterization
//(1) locations
mu_mix_location = 15 * mu_mix_location_raw;
sigma_mix_location = 15 * sigma_mix_location_raw;
sigma_location = 15 * sigma_location_raw;
//(2) scales
mu_mix_scale = 5 + 5 * mu_mix_scale_raw;
sigma_mix_scale = 5 + 5 * sigma_mix_scale_raw;
sigma_scale = 5 + 5 * sigma_scale_raw;
//sigma_mix & sigma
sigma_mix = sigma_mix_location + sigma_mix_scale .* sigma_mix_raw;
sigma = sigma_location + sigma_scale .* sigma_raw;
}
model {
// likelihood
target += normal_lpdf(Yc | Xc * b, sigma);
//priors
//coefficients vector
for (k in 1:K) {
vector [G] ps;
ps[1] = log(theta[1]) + normal_lpdf(b[k] | mu_mix[1], sigma_mix[1]);
ps[2] = log(theta[2]) + normal_lpdf(b[k] | mu_mix[2], sigma_mix[2]);
ps[3] = log(theta[3]) + normal_lpdf(b[k] | 0, sigma_mix[3]);
target += log_sum_exp(ps);
}
//residual sigma
target += std_normal_lpdf(sigma_raw);
//hyperpriors
//(1) mu_mix
//(1.1) mu_mix hyperprior
target += normal_lpdf(mu_mix | mu_mix_location, mu_mix_scale);
//(1.2) hyperpriors for mu_mix location and scale
target += std_normal_lpdf(mu_mix_location_raw);
target += std_normal_lpdf(mu_mix_scale_raw);
//(2) sigma_mix
//(2.1) non-centered sigma_mix hyperprior
target += std_normal_lpdf(sigma_mix_raw);
//(2.2) hyperpriors for sigma_mix location and scale
target += std_normal_lpdf(sigma_mix_location_raw);
target += std_normal_lpdf(sigma_mix_scale_raw);
//(3) hyperprior for theta
target += dirichlet_lpdf(theta | rep_vector(5, G));
//(4) hyperpriors for sigma location and scale
target += std_normal_lpdf(sigma_location_raw);
target += std_normal_lpdf(sigma_scale_raw);
}"
stan_model <- stan_model(model_code = stan_code_regression)
fit_regression = stan(model_code = stan_code_regression,
model_name = "seed1",
data = standata,
chains = 4,
cores = 4,
iter = 4000,
warmup = 2000,
control=list(adapt_delta=0.95,
stepsize=0.01))
print(fit_regression)
I’ve run the same model for 10 different seed values. For set.seed(1) everything runs perfect and the results look as the following:
Inference for Stan model: seed1.
4 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=8000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0.26 0.00 0.03 0.21 0.24 0.26 0.27 0.31 11394 1
theta[2] 0.25 0.00 0.03 0.20 0.23 0.25 0.27 0.30 11382 1
theta[3] 0.49 0.00 0.03 0.44 0.47 0.49 0.51 0.55 11648 1
mu_mix[1] -4.62 0.00 0.27 -5.10 -4.80 -4.63 -4.45 -4.05 5443 1
mu_mix[2] 2.95 0.00 0.13 2.66 2.86 2.96 3.04 3.20 7054 1
sigma_mix[1] 1.92 0.00 0.23 1.55 1.75 1.89 2.05 2.46 5811 1
sigma_mix[2] 0.96 0.00 0.12 0.76 0.88 0.94 1.02 1.22 5895 1
sigma_mix[3] 0.22 0.00 0.02 0.18 0.20 0.22 0.23 0.26 4437 1
lp__ -1649.09 0.43 18.01 -1685.80 -1661.09 -1648.58 -1636.74 -1615.01 1777 1
Samples were drawn using NUTS(diag_e) at Mon Aug 26 18:42:01 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
For set.seed(2) I get the following warning message:
Warning messages:
1: The largest R-hat is 1.53, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
2: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
3: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess
The results for the mixture parameters are awful:
Inference for Stan model: seed2.
4 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=8000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0.31 0.06 0.09 0.21 0.25 0.27 0.35 0.49 2 3.57
theta[2] 0.23 0.03 0.05 0.12 0.20 0.25 0.27 0.30 3 2.08
theta[3] 0.46 0.03 0.05 0.35 0.43 0.47 0.50 0.54 3 1.69
mu_mix[1] -4.05 1.67 2.37 -5.87 -5.53 -5.32 -3.14 0.05 2 11.23
mu_mix[2] 2.94 0.04 0.15 2.67 2.84 2.94 3.03 3.27 14 1.09
sigma_mix[1] 1.58 0.57 0.82 0.17 1.12 1.94 2.12 2.47 2 4.91
sigma_mix[2] 0.95 0.08 0.15 0.60 0.86 0.96 1.05 1.23 4 1.39
sigma_mix[3] 1.39 1.45 2.06 0.18 0.20 0.22 1.20 5.42 2 12.09
lp__ -1703.37 15.77 27.91 -1763.05 -1721.83 -1698.18 -1683.01 -1659.04 3 1.64
Samples were drawn using NUTS(diag_e) at Mon Aug 26 18:26:55 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
For set.seed(3) the same picture…
Warning messages:
1: The largest R-hat is 1.53, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
2: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
3: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess
…with the following results:
Inference for Stan model: seed3.
4 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=8000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0.23 0.03 0.05 0.12 0.20 0.24 0.27 0.30 3 2.07
theta[2] 0.30 0.05 0.08 0.21 0.25 0.27 0.34 0.47 2 3.18
theta[3] 0.47 0.02 0.05 0.36 0.44 0.48 0.50 0.54 3 1.51
mu_mix[1] -5.20 0.17 0.32 -5.96 -5.36 -5.15 -4.98 -4.69 4 1.48
mu_mix[2] 2.27 0.94 1.33 -0.05 1.73 2.98 3.09 3.28 2 11.40
sigma_mix[1] 1.62 0.11 0.23 1.09 1.49 1.64 1.76 2.04 4 1.35
sigma_mix[2] 0.88 0.29 0.42 0.15 0.64 1.06 1.16 1.37 2 4.35
sigma_mix[3] 1.03 1.01 1.44 0.17 0.19 0.21 0.87 3.87 2 10.64
lp__ -1659.65 13.89 25.81 -1714.80 -1676.34 -1655.82 -1640.46 -1617.75 3 1.52
Samples were drawn using NUTS(diag_e) at Mon Aug 26 19:04:36 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Set.seed(4) runs fine again:
Inference for Stan model: seed4.
4 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=8000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0.25 0.00 0.03 0.20 0.23 0.25 0.26 0.30 11996 1
theta[2] 0.26 0.00 0.03 0.21 0.24 0.25 0.27 0.31 13237 1
theta[3] 0.50 0.00 0.03 0.44 0.48 0.50 0.52 0.55 13928 1
mu_mix[1] -5.48 0.00 0.36 -6.12 -5.72 -5.49 -5.26 -4.70 7267 1
mu_mix[2] 2.88 0.00 0.13 2.61 2.79 2.88 2.96 3.12 8380 1
sigma_mix[1] 2.46 0.00 0.29 2.00 2.25 2.42 2.63 3.11 6580 1
sigma_mix[2] 1.01 0.00 0.11 0.83 0.93 1.00 1.07 1.26 7047 1
sigma_mix[3] 0.22 0.00 0.02 0.18 0.20 0.22 0.23 0.25 5621 1
lp__ -1659.39 0.41 17.68 -1694.90 -1671.06 -1658.98 -1647.36 -1625.58 1827 1
Samples were drawn using NUTS(diag_e) at Mon Aug 26 20:02:47 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
And so on and so forth. Is there anything I can do here? Of course I want my model to be robust against these little variations in the data. Especially when I can’t know in the case of any fitting issues with our original data if these are due to the Stan program I wrote or if the data only reflects the “wrong seed”.
I actually could fit the data above varying the seed for the Stan program. E.g., for set.seed(2) in R and seed = 3 in Stan the model runs perfect:
Inference for Stan model: seed2.
4 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=8000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0.26 0.00 0.02 0.21 0.24 0.26 0.27 0.31 14650 1
theta[2] 0.26 0.00 0.02 0.21 0.24 0.26 0.27 0.31 13737 1
theta[3] 0.49 0.00 0.03 0.43 0.47 0.49 0.51 0.54 13892 1
mu_mix[1] -5.41 0.00 0.25 -5.89 -5.58 -5.42 -5.26 -4.90 7648 1
mu_mix[2] 2.91 0.00 0.13 2.65 2.83 2.91 3.00 3.15 7827 1
sigma_mix[1] 2.04 0.00 0.20 1.70 1.90 2.02 2.16 2.49 7251 1
sigma_mix[2] 1.01 0.00 0.10 0.83 0.93 1.00 1.07 1.23 7526 1
sigma_mix[3] 0.21 0.00 0.02 0.18 0.20 0.21 0.22 0.25 5489 1
lp__ -1691.32 0.43 18.24 -1728.78 -1703.09 -1690.84 -1678.95 -1656.87 1831 1
Samples were drawn using NUTS(diag_e) at Mon Aug 26 21:38:20 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Would there be anything that speaks against fitting the model with some random seed in Stan multiple times until I achieve Rhat = 1?
Thanks
Dirk