Multimodality issues in regression model with mixture prior

Hey everyone,

I’m still at the beginning of learning Bayesian statistics and Stan. So please excuse me if something in my post or code makes little or no sense :) I’m pretty sure my code is not the cleanest and efficient code possible, but I tried my best.

For a research project, we try to fit a linear regression model with a mixture prior on the coefficients vector. The aim of our project is to identify patterns in the coefficients and to identify clusters of variables which have a similar effect. So, we want to estimate both, the coefficients and the mu and sigma parameters of the mixture components. Further, we add an additional mixture component with a mean restricted to 0 to cluster variables without an effect into this component. Our original data set consists of around 3,700 observations and around 1,350 predictors on the same scale with a handful additional control variables, which we would exclude from the clustering process. We assume that only a few hundred of the 1,350 predictors will actually have a significant effect.

Our model is mainly based on Yengo, Jacques, Biernacki & Canouil (2016): “Variable Clustering in High-Dimensional Linear Regression: The R Package clere”, The R Journal, 8 (1), 92-106 and Moser, Lee, Hayes, Goddard, Wray & Visscher (2015), “Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model”, PLoS Genet, 11 (4), 1-22.
The first paper assumes different values for the mu’s and the same sigma for the different components. It is not a fully Bayesian but Empirical Bayesian framework estimated by EM-algorithm which we would extend to a fully Bayesian framework. The second paper restricts the mu for all components to 0 and assumes different sigmas for the components (i.e., there is only one sigma parameter estimated, which is multiplied by a factor which varies across components). However, we are interested in both, varying mus and sigmas.

With some toy data the first version of the model with two components plus an additional component with the mu restricted to 0 runs ok most of the time.
The only problem so far is that I sometimes faced some weird sampling problems after varying the toy data by rerunning the code below without setting any seeds. After setting seed manually, for some seed values the model runs perfect, for others n_eff and Rhats are awful for the mixture parameters.

I use R Studio with the rstan package on Windows 10. The session info is the following:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_Germany.1252  LC_CTYPE=English_Germany.1252    LC_MONETARY=English_Germany.1252
[4] LC_NUMERIC=C                     LC_TIME=English_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0         stringr_1.4.0         dplyr_0.8.3           purrr_0.3.2           readr_1.3.1          
 [6] tidyr_0.8.3           tibble_2.1.3          tidyverse_1.2.1       rstan_2.19.2          ggplot2_3.2.0        
[11] StanHeaders_2.18.1-10

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5   haven_2.1.1        lattice_0.20-38    colorspace_1.4-1   generics_0.0.2     vctrs_0.2.0       
 [7] stats4_3.6.1       loo_2.1.0          rlang_0.4.0        pkgbuild_1.0.3     pillar_1.4.2       glue_1.3.1        
[13] withr_2.1.2        readxl_1.3.1       modelr_0.1.4       matrixStats_0.54.0 cellranger_1.1.0   munsell_0.5.0     
[19] gtable_0.3.0       rvest_0.3.4        codetools_0.2-16   inline_0.3.15      callr_3.3.1        ps_1.3.0          
[25] parallel_3.6.1     broom_0.5.2        Rcpp_1.0.2         scales_1.0.0       backports_1.1.4    jsonlite_1.6      
[31] gridExtra_2.3      hms_0.5.0          stringi_1.4.3      processx_3.4.1     grid_3.6.1         cli_1.1.0         
[37] tools_3.6.1        magrittr_1.5       lazyeval_0.2.2     crayon_1.3.4       pkgconfig_2.0.2    zeallot_0.1.0     
[43] xml2_1.2.1         prettyunits_1.0.2  lubridate_1.7.4    assertthat_0.2.1   httr_1.4.0         rstudioapi_0.10   
[49] R6_2.4.0           nlme_3.1-140       compiler_3.6.1

The code for data and the model I run is the following. I know that the (hyper-)priors are pretty weakly informative, but it works fine this way.

library(rstan)
library(tidyverse)

# Run a clusterwise effects regression with zero coefficients with a high number of regressors 
#and varying sigma for each mixture component 
options(max.print = 100000,
        scipen = 999999)

# Generate data

set.seed(1) #set.seed(1) – set.seed(10)
n <- 500 #observations
p <- 300 #number of x variables excl. intercept
real_p <- 150 #number of "true" predictors which have an effect

mu_mix1 <- -5
mu_mix2 <- 3
sigma_mix1 <- 2
sigma_mix2 <- 1
sigma_mix3 <- 0.2
sigma <- 2

X <- matrix(rnorm(n*p, 10, 1), nrow=n, ncol=p)
X <- cbind(1, X) %>% data.matrix()
b <- c(5, 
       rnorm(real_p/2, mu_mix1, sigma_mix1), #real_p/2 predictors have a positive effect around mu_mix1,
       rnorm(real_p/2, mu_mix2, sigma_mix2), #real_p/2 predictors have a negative effect around mu_mix2,
       rnorm(p-real_p, 0, sigma_mix3)) #all the other predictors have an effect around 0
#implies theta proportions: component 1 = 0.25, component 2 = 0.25, component 3 = 0.5

y <- as.numeric(X %*% b + rnorm(n, 0, sigma))


#Stan data
standata = list(N = nrow(X),
                K = ncol(X[,-1]),
                Y = as.numeric(y),
                X = X[,-1],
                G = 3)

#  Model in Stan


stan_code_regression="data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of predictors
  matrix[N, K] X;  // predictor matrix
  int<lower=1> G; // number of groups
}
transformed data {
  matrix[N, K] Xc;  // centered version of X
  vector[K] means_X;  // column means of X before centering
  vector[N] Yc; // centered version of Y
  real mean_Y;  // mean of Y before centering
  for (i in 1:K) {
    means_X[i] = mean(X[, i]);
    Xc[, i] = X[, i] - means_X[i];
  }
  mean_Y = mean(Y);
  Yc = Y - mean_Y;
}
parameters {
  vector [K] b;  // coefficients
  simplex [G] theta; // theta for the mixture
  ordered [G-1] mu_mix; // mus for the mixture
  vector [G-1] mu_mix_location_raw; // location for mu_mix
  vector <lower=0> [G-1] mu_mix_scale_raw; // scale for mu_mix
  vector <lower=0> [G] sigma_mix_raw; //sigma for the mixture
  vector <lower=0> [G] sigma_mix_location_raw; // location for sigma_mix
  vector <lower=0> [G] sigma_mix_scale_raw; // scale for sigma_mix
  real <lower=0> sigma_raw;  // residual SD
  real <lower=0> sigma_location_raw;  // location for sigma
  real <lower=0> sigma_scale_raw;  // scale for sigma
}
transformed parameters {
  vector[G] sigma_mix;
  real sigma;
  vector[G-1] mu_mix_location;
  vector[G] sigma_mix_location;
  real sigma_location;
  vector[G-1] mu_mix_scale;
  vector[G] sigma_mix_scale;
  real sigma_scale;
  
  //non centered parameterization
   
  //(1) locations
  
  mu_mix_location = 15 * mu_mix_location_raw;
  sigma_mix_location = 15 * sigma_mix_location_raw;
  sigma_location =  15 * sigma_location_raw;
  
  //(2) scales
  
  mu_mix_scale = 5 + 5 * mu_mix_scale_raw;
  sigma_mix_scale = 5 + 5 * sigma_mix_scale_raw;
  sigma_scale = 5 + 5 * sigma_scale_raw;
  
  
  
  //sigma_mix & sigma
  sigma_mix = sigma_mix_location + sigma_mix_scale .* sigma_mix_raw;
  sigma = sigma_location + sigma_scale .* sigma_raw;
}
model {
  // likelihood
  target += normal_lpdf(Yc | Xc * b, sigma);
  
  
  
  //priors
  
  //coefficients vector
  for (k in 1:K) {
    vector [G] ps;
    ps[1] = log(theta[1]) + normal_lpdf(b[k] | mu_mix[1], sigma_mix[1]);
    ps[2] = log(theta[2]) + normal_lpdf(b[k] | mu_mix[2], sigma_mix[2]);
    ps[3] = log(theta[3]) + normal_lpdf(b[k] | 0, sigma_mix[3]);
    target += log_sum_exp(ps);
  }
  
  //residual sigma
  target += std_normal_lpdf(sigma_raw);


  
  //hyperpriors
  
  //(1) mu_mix
  
  //(1.1) mu_mix hyperprior
  target += normal_lpdf(mu_mix | mu_mix_location, mu_mix_scale);
  
  //(1.2) hyperpriors for mu_mix location and scale
  target += std_normal_lpdf(mu_mix_location_raw);
  target += std_normal_lpdf(mu_mix_scale_raw);
  
  //(2) sigma_mix
  
  //(2.1) non-centered sigma_mix hyperprior
  target += std_normal_lpdf(sigma_mix_raw);
  
  //(2.2) hyperpriors for sigma_mix location and scale
  target += std_normal_lpdf(sigma_mix_location_raw);
  target += std_normal_lpdf(sigma_mix_scale_raw);
  
  //(3) hyperprior for theta
  target += dirichlet_lpdf(theta | rep_vector(5, G));
  
  //(4) hyperpriors for sigma location and scale
  target += std_normal_lpdf(sigma_location_raw);
  target += std_normal_lpdf(sigma_scale_raw);
}"

stan_model <- stan_model(model_code = stan_code_regression)

fit_regression = stan(model_code = stan_code_regression,
                      model_name = "seed1",
                      data = standata,
                      chains = 4,
                      cores = 4,
                      iter = 4000,
                      warmup = 2000,
                      control=list(adapt_delta=0.95,
                                   stepsize=0.01))

print(fit_regression)

I’ve run the same model for 10 different seed values. For set.seed(1) everything runs perfect and the results look as the following:

Inference for Stan model: seed1.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

                              mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
theta[1]                      0.26    0.00  0.03     0.21     0.24     0.26     0.27     0.31 11394    1
theta[2]                      0.25    0.00  0.03     0.20     0.23     0.25     0.27     0.30 11382    1
theta[3]                      0.49    0.00  0.03     0.44     0.47     0.49     0.51     0.55 11648    1
mu_mix[1]                    -4.62    0.00  0.27    -5.10    -4.80    -4.63    -4.45    -4.05  5443    1
mu_mix[2]                     2.95    0.00  0.13     2.66     2.86     2.96     3.04     3.20  7054    1
sigma_mix[1]                  1.92    0.00  0.23     1.55     1.75     1.89     2.05     2.46  5811    1
sigma_mix[2]                  0.96    0.00  0.12     0.76     0.88     0.94     1.02     1.22  5895    1
sigma_mix[3]                  0.22    0.00  0.02     0.18     0.20     0.22     0.23     0.26  4437    1
lp__                      -1649.09    0.43 18.01 -1685.80 -1661.09 -1648.58 -1636.74 -1615.01  1777    1

Samples were drawn using NUTS(diag_e) at Mon Aug 26 18:42:01 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

For set.seed(2) I get the following warning message:

Warning messages:
1: The largest R-hat is 1.53, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
2: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
3: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

The results for the mixture parameters are awful:

Inference for Stan model: seed2.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

                              mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
theta[1]                      0.31    0.06  0.09     0.21     0.25     0.27     0.35     0.49     2  3.57
theta[2]                      0.23    0.03  0.05     0.12     0.20     0.25     0.27     0.30     3  2.08
theta[3]                      0.46    0.03  0.05     0.35     0.43     0.47     0.50     0.54     3  1.69
mu_mix[1]                    -4.05    1.67  2.37    -5.87    -5.53    -5.32    -3.14     0.05     2 11.23
mu_mix[2]                     2.94    0.04  0.15     2.67     2.84     2.94     3.03     3.27    14  1.09
sigma_mix[1]                  1.58    0.57  0.82     0.17     1.12     1.94     2.12     2.47     2  4.91
sigma_mix[2]                  0.95    0.08  0.15     0.60     0.86     0.96     1.05     1.23     4  1.39
sigma_mix[3]                  1.39    1.45  2.06     0.18     0.20     0.22     1.20     5.42     2 12.09
lp__                      -1703.37   15.77 27.91 -1763.05 -1721.83 -1698.18 -1683.01 -1659.04     3  1.64

Samples were drawn using NUTS(diag_e) at Mon Aug 26 18:26:55 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

For set.seed(3) the same picture…

Warning messages:
1: The largest R-hat is 1.53, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
2: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
3: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

…with the following results:

Inference for Stan model: seed3.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

                              mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
theta[1]                      0.23    0.03  0.05     0.12     0.20     0.24     0.27     0.30     3  2.07
theta[2]                      0.30    0.05  0.08     0.21     0.25     0.27     0.34     0.47     2  3.18
theta[3]                      0.47    0.02  0.05     0.36     0.44     0.48     0.50     0.54     3  1.51
mu_mix[1]                    -5.20    0.17  0.32    -5.96    -5.36    -5.15    -4.98    -4.69     4  1.48
mu_mix[2]                     2.27    0.94  1.33    -0.05     1.73     2.98     3.09     3.28     2 11.40
sigma_mix[1]                  1.62    0.11  0.23     1.09     1.49     1.64     1.76     2.04     4  1.35
sigma_mix[2]                  0.88    0.29  0.42     0.15     0.64     1.06     1.16     1.37     2  4.35
sigma_mix[3]                  1.03    1.01  1.44     0.17     0.19     0.21     0.87     3.87     2 10.64
lp__                      -1659.65   13.89 25.81 -1714.80 -1676.34 -1655.82 -1640.46 -1617.75     3  1.52

Samples were drawn using NUTS(diag_e) at Mon Aug 26 19:04:36 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Set.seed(4) runs fine again:

Inference for Stan model: seed4.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

                              mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
theta[1]                      0.25    0.00  0.03     0.20     0.23     0.25     0.26     0.30 11996    1
theta[2]                      0.26    0.00  0.03     0.21     0.24     0.25     0.27     0.31 13237    1
theta[3]                      0.50    0.00  0.03     0.44     0.48     0.50     0.52     0.55 13928    1
mu_mix[1]                    -5.48    0.00  0.36    -6.12    -5.72    -5.49    -5.26    -4.70  7267    1
mu_mix[2]                     2.88    0.00  0.13     2.61     2.79     2.88     2.96     3.12  8380    1
sigma_mix[1]                  2.46    0.00  0.29     2.00     2.25     2.42     2.63     3.11  6580    1
sigma_mix[2]                  1.01    0.00  0.11     0.83     0.93     1.00     1.07     1.26  7047    1
sigma_mix[3]                  0.22    0.00  0.02     0.18     0.20     0.22     0.23     0.25  5621    1
lp__                      -1659.39    0.41 17.68 -1694.90 -1671.06 -1658.98 -1647.36 -1625.58  1827    1

Samples were drawn using NUTS(diag_e) at Mon Aug 26 20:02:47 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

And so on and so forth. Is there anything I can do here? Of course I want my model to be robust against these little variations in the data. Especially when I can’t know in the case of any fitting issues with our original data if these are due to the Stan program I wrote or if the data only reflects the “wrong seed”.

I actually could fit the data above varying the seed for the Stan program. E.g., for set.seed(2) in R and seed = 3 in Stan the model runs perfect:

Inference for Stan model: seed2.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

                              mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
theta[1]                      0.26    0.00  0.02     0.21     0.24     0.26     0.27     0.31 14650    1
theta[2]                      0.26    0.00  0.02     0.21     0.24     0.26     0.27     0.31 13737    1
theta[3]                      0.49    0.00  0.03     0.43     0.47     0.49     0.51     0.54 13892    1
mu_mix[1]                    -5.41    0.00  0.25    -5.89    -5.58    -5.42    -5.26    -4.90  7648    1
mu_mix[2]                     2.91    0.00  0.13     2.65     2.83     2.91     3.00     3.15  7827    1
sigma_mix[1]                  2.04    0.00  0.20     1.70     1.90     2.02     2.16     2.49  7251    1
sigma_mix[2]                  1.01    0.00  0.10     0.83     0.93     1.00     1.07     1.23  7526    1
sigma_mix[3]                  0.21    0.00  0.02     0.18     0.20     0.21     0.22     0.25  5489    1
lp__                      -1691.32    0.43 18.24 -1728.78 -1703.09 -1690.84 -1678.95 -1656.87  1831    1

Samples were drawn using NUTS(diag_e) at Mon Aug 26 21:38:20 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Would there be anything that speaks against fitting the model with some random seed in Stan multiple times until I achieve Rhat = 1?

Thanks
Dirk

Ok, I’ve run the model now with init = 0 twice. The first run didn’t converge, the second run converged perfectly. So the problem shouldn’t be related with the initial values. Maybe this helps with my problem :) Thanks for any ideas and help!

I don’t know the specific models, but it sounds like multimodality. For some starting points or random seeds all chains explore just one mode, but for other they explore more than one. If that’s the case, you can’t just report results from when results appear to be fine, because in those cases you haven’t fully explored the posterior. The model must be fixed somehow.

Thanks for your answer! Following the case study “Identifying Bayesian Mixture Models” I already defined the mus of the mixture components as an ordered vector. Unfortunately, that doesn’t really seem to solve the problem for me.

I further checked the posteriors in Shinystan. For a model which didn’t converge the multimodality is pretty obvious:

For a model which converged well, the posteriors look fine:

I further tried now to estimate also the third mu which I originally restricted to zero:

So instead of…

parameters {
  ...
  ordered [G-1] mu_mix; // mus for the mixture
  vector [G-1] mu_mix_location_raw; // location for mu_mix
  vector <lower=0> [G-1] mu_mix_scale_raw; // scale for mu_mix
  ...
}
model {
    ...
    for (k in 1:K) {
    vector [G] ps;
    ps[1] = log(theta[1]) + normal_lpdf(b[k] | mu_mix[1], sigma_mix[1]);
    ps[2] = log(theta[2]) + normal_lpdf(b[k] | mu_mix[2], sigma_mix[2]);
    ps[3] = log(theta[3]) + normal_lpdf(b[k] | 0, sigma_mix[3]);
    target += log_sum_exp(ps);
  }
   ...
}

…I’ve run:

parameters {
  ...
  ordered [G] mu_mix; // mus for the mixture
  vector [G] mu_mix_location_raw; // location for mu_mix
  vector <lower=0> [G] mu_mix_scale_raw; // scale for mu_mix
  ...
}
model {
    ...
    for (k in 1:K) {
    vector [G] ps;
    ps[1] = log(theta[1]) + normal_lpdf(b[k] | mu_mix[1], sigma_mix[1]);
    ps[2] = log(theta[2]) + normal_lpdf(b[k] | mu_mix[2], sigma_mix[2]);
    ps[3] = log(theta[3]) + normal_lpdf(b[k] | mu_mix[3], sigma_mix[3]);
    target += log_sum_exp(ps);
  }
   ...
}

I thought that this might fix the problem because the previously restricted mu is now included in the ordered vector. Unfortunately I still have the problem that the model sometimes converges and sometimes not.

I also tried out defining the mu_mix_location hyperprior as ordered. The same problem. Further, I directly defined the hyperprior for mu_mix as:

model {
    ...
      //(1.1) mu_mix hyperprior
  target += normal_lpdf(mu_mix | 0, 15);
   ...
}

… instead of using hyperpriors for the location and scale. The same problem :(