Trouble with Gaussian Mixture Model

I’m trying to fit a gaussian mixture model following the approach here.

data {
    int D; // number of dimensions
    int K; // number of gaussians
    int N; // number of data
    vector[D] y[N]; // data
  }

parameters {
  simplex[K] theta; // mixing proportions
  ordered[D] mu[K]; // mixture component means
  cholesky_factor_corr[D] L[K]; // cholesky factor of covariance
}

model {
  real ps[K];

  for (k in 1:K) {
    mu[k] ~ normal(0,3);
    L[k] ~ lkj_corr_cholesky(4);
  }

  for (n in 1:N) {

    for (k in 1:K) {
      //increment log probability of the gaussian
      ps[k] = log(theta[k]) + multi_normal_cholesky_lpdf(y[n] | mu[k], L[k]);
    }

    target += log_sum_exp(ps);
  }
}

I’ve been sanity checking on fake data as follows:

library(MASS)
library(rstan)
library(readr)
library(here)

n_per_clus <- 30
n_clus <- 3
d <- 4
sigma <- 0.1 * diag(4)

mu_1 <- rep(0, 4)
mu_2 <- rep(3, 4)
mu_3 <- rep(6, 4)

clus_1 <- mvrnorm(30, mu_1, sigma)
clus_2 <- mvrnorm(30, mu_2, sigma)
clus_3 <- mvrnorm(30, mu_3, sigma)

mixture_data <- list(
  y = rbind(clus_1, clus_2, clus_3),
  N = n_per_clus * n_clus,
  D = d,
  K = 3
)

gmm <- stan_model(here("stan", "gaussian_mixture_model.stan"))

fit <- sampling(
  gmm,
  data = mixture_data,
  chains = 3,
  cores = 2,
  iter = 1000,
  control = list(adapt_delta = 0.98)
)

print(fit)

which results in the following

Inference for Stan model: 5211a6688a83a26504506fd103be7703.
3 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1500.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu[1,1] -0.16    0.00 0.10 -0.37 -0.22 -0.15 -0.09 -0.01   440 1.00
mu[1,2]  0.00    0.04 0.10 -0.19 -0.06 -0.01  0.05  0.24     6 1.16
mu[1,3]  0.07    0.03 0.09 -0.08  0.01  0.05  0.11  0.28     9 1.11
mu[1,4]  0.15    0.02 0.09  0.02  0.08  0.13  0.20  0.38    16 1.08
mu[2,1] -0.57    2.99 3.96 -7.02 -3.67 -1.60  4.40  4.64     2 2.47
mu[2,2]  0.94    2.12 2.86 -3.65 -1.45  0.29  4.42  4.66     2 2.25
mu[2,3]  2.10    1.47 2.22 -2.13  0.38  1.93  4.46  4.70     2 1.66
mu[2,4]  3.56    0.59 1.85 -0.45  2.36  4.29  4.61  7.04    10 1.10
mu[3,1]  1.96    2.93 3.76 -6.54 -1.49  4.38  4.51  4.70     2 3.25
mu[3,2]  2.74    2.05 2.67 -3.32  0.43  4.40  4.53  4.73     2 2.86
mu[3,3]  3.32    1.42 1.96 -1.46  2.13  4.42  4.55  4.76     2 2.11
mu[3,4]  4.07    0.58 1.35  0.33  4.25  4.51  4.63  6.09     6 1.19

Samples were drawn using NUTS(diag_e) at Sun Nov 25 19:10:44 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

where the Rhat is fairly concerning and I can find label switching in the trace plots. I asked about this in the MSW slack and got recommendations to implement an ordering on the means, which (I believe?) is already implemented here because of the ordered[D] mu[K] line in the model code.

The other piece of advice I got was to initialize with kmeans or similar, so I tried:

gmm2 <- stan_model(here("stan", "gaussian_mixture_model2.stan"))

fit2 <- sampling(
  gmm2,
  data = mixture_data,
  chains = 3,
  cores = 2,
  iter = 1000,
  control = list(adapt_delta = 0.98),
  init = list(
    list(mu = rbind(mu_1, mu_2, mu_3)),
    list(mu = rbind(mu_1, mu_2, mu_3)),
    list(mu = rbind(mu_1, mu_2, mu_3))
  )
)

where now the means aren’t ordered because that was throwing an error about getting the wrong type. Anyway, I still end up with:

Inference for Stan model: gaussian_mixture_model2.
3 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1500.

            mean se_mean   sd    2.5%     25%     50%     75%
theta[1]    0.33    0.00 0.05    0.24    0.30    0.33    0.36
theta[2]    0.34    0.00 0.05    0.24    0.30    0.34    0.37
theta[3]    0.33    0.00 0.05    0.24    0.30    0.33    0.36
mu[1,1]    -0.04    0.01 0.17   -0.39   -0.16   -0.04    0.08
mu[1,2]     0.03    0.01 0.18   -0.31   -0.10    0.03    0.15
mu[1,3]     0.00    0.01 0.18   -0.35   -0.13    0.00    0.12
mu[1,4]     0.07    0.01 0.18   -0.28   -0.05    0.07    0.19
mu[2,1]     3.03    0.01 0.20    2.66    2.90    3.03    3.15
mu[2,2]     3.06    0.01 0.19    2.69    2.93    3.05    3.18
mu[2,3]     2.95    0.01 0.20    2.56    2.82    2.95    3.08
mu[2,4]     3.07    0.01 0.19    2.71    2.94    3.07    3.19
mu[3,1]     6.06    0.03 0.19    5.68    5.92    6.06    6.19
mu[3,2]     5.93    0.03 0.19    5.56    5.80    5.93    6.07
mu[3,3]     5.95    0.03 0.19    5.56    5.82    5.95    6.08
mu[3,4]     5.97    0.03 0.19    5.57    5.84    5.96    6.10
L[1,1,1]    1.00     NaN 0.00    1.00    1.00    1.00    1.00
L[1,1,2]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[1,1,3]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[1,1,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[1,2,1]   -0.01    0.48 0.81   -0.91   -0.83   -0.52    0.85
L[1,2,2]    0.57    0.00 0.12    0.39    0.49    0.55    0.63
L[1,2,3]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[1,2,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[1,3,1]    0.24    0.24 0.77   -0.91   -0.80    0.75    0.84
L[1,3,2]   -0.06    0.08 0.31   -0.58   -0.30   -0.13    0.22
L[1,3,3]    0.49    0.00 0.09    0.34    0.42    0.47    0.54
L[1,3,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[1,4,1]   -0.50    0.22 0.62   -0.92   -0.87   -0.81   -0.63
L[1,4,2]    0.24    0.05 0.24   -0.24    0.09    0.25    0.40
L[1,4,3]   -0.25    0.04 0.14   -0.55   -0.33   -0.24   -0.15
L[1,4,4]    0.40    0.00 0.07    0.28    0.35    0.39    0.44
L[2,1,1]    1.00     NaN 0.00    1.00    1.00    1.00    1.00
L[2,1,2]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[2,1,3]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[2,1,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[2,2,1]    0.80    0.00 0.09    0.59    0.78    0.83    0.86
L[2,2,2]    0.58    0.00 0.10    0.42    0.51    0.57    0.63
L[2,2,3]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[2,2,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[2,3,1]    0.82    0.00 0.10    0.62    0.79    0.84    0.87
L[2,3,2]    0.30    0.00 0.13    0.06    0.21    0.29    0.37
L[2,3,3]    0.46    0.00 0.08    0.33    0.40    0.45    0.50
L[2,3,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[2,4,1]    0.86    0.00 0.06    0.70    0.84    0.87    0.90
L[2,4,2]    0.32    0.00 0.12    0.10    0.24    0.31    0.39
L[2,4,3]    0.06    0.00 0.10   -0.14    0.00    0.06    0.13
L[2,4,4]    0.35    0.00 0.06    0.25    0.31    0.34    0.38
L[3,1,1]    1.00     NaN 0.00    1.00    1.00    1.00    1.00
L[3,1,2]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[3,1,3]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[3,1,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[3,2,1]    0.88    0.00 0.06    0.73    0.86    0.89    0.91
L[3,2,2]    0.47    0.00 0.09    0.33    0.41    0.46    0.51
L[3,2,3]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[3,2,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[3,3,1]   -0.33    0.65 0.80   -0.95   -0.91   -0.87    0.76
L[3,3,2]   -0.01    0.12 0.19   -0.32   -0.15   -0.04    0.11
L[3,3,3]    0.46    0.05 0.10    0.31    0.38    0.44    0.52
L[3,3,4]    0.00     NaN 0.00    0.00    0.00    0.00    0.00
L[3,4,1]   -0.25    0.60 0.78   -0.92   -0.87   -0.80    0.75
L[3,4,2]   -0.12    0.11 0.20   -0.48   -0.26   -0.14    0.01
L[3,4,3]    0.21    0.08 0.16   -0.12    0.11    0.20    0.31
L[3,4,4]    0.45    0.01 0.09    0.32    0.39    0.43    0.49
lp__     -389.50    1.86 5.25 -400.88 -393.05 -389.08 -385.65
           97.5% n_eff  Rhat
theta[1]    0.43  2059  1.00
theta[2]    0.44  1758  1.00
theta[3]    0.43  1973  1.00
mu[1,1]     0.29   816  1.00
mu[1,2]     0.36   800  1.00
mu[1,3]     0.34   832  1.00
mu[1,4]     0.43   872  1.00
mu[2,1]     3.44   728  1.00
mu[2,2]     3.45   814  1.00
mu[2,3]     3.35   824  1.00
mu[2,4]     3.47   803  1.00
mu[3,1]     6.41    43  1.05
mu[3,2]     6.31    54  1.04
mu[3,3]     6.30    35  1.05
mu[3,4]     6.32    48  1.04
L[1,1,1]    1.00   NaN   NaN
L[1,1,2]    0.00   NaN   NaN
L[1,1,3]    0.00   NaN   NaN
L[1,1,4]    0.00   NaN   NaN
L[1,2,1]    0.91     3  1.64
L[1,2,2]    0.88   645  1.00
L[1,2,3]    0.00   NaN   NaN
L[1,2,4]    0.00   NaN   NaN
L[1,3,1]    0.91    10  1.46
L[1,3,2]    0.54    14  1.20
L[1,3,3]    0.70   767  1.01
L[1,3,4]    0.00   NaN   NaN
L[1,4,1]    0.83     8  1.71
L[1,4,2]    0.71    22  1.14
L[1,4,3]    0.01    15  1.07
L[1,4,4]    0.57   961  1.00
L[2,1,1]    1.00   NaN   NaN
L[2,1,2]    0.00   NaN   NaN
L[2,1,3]    0.00   NaN   NaN
L[2,1,4]    0.00   NaN   NaN
L[2,2,1]    0.91   714  1.00
L[2,2,2]    0.81   859  1.00
L[2,2,3]    0.00   NaN   NaN
L[2,2,4]    0.00   NaN   NaN
L[2,3,1]    0.92   516  1.01
L[2,3,2]    0.56   898  1.00
L[2,3,3]    0.65   843  1.01
L[2,3,4]    0.00   NaN   NaN
L[2,4,1]    0.94   801  1.00
L[2,4,2]    0.57   859  1.00
L[2,4,3]    0.26  1185  1.00
L[2,4,4]    0.50  1498  1.00
L[3,1,1]    1.00   NaN   NaN
L[3,1,2]    0.00   NaN   NaN
L[3,1,3]    0.00   NaN   NaN
L[3,1,4]    0.00   NaN   NaN
L[3,2,1]    0.94   545  1.00
L[3,2,2]    0.68   797  1.00
L[3,2,3]    0.00   NaN   NaN
L[3,2,4]    0.00   NaN   NaN
L[3,3,1]    0.90     2 12.13
L[3,3,2]    0.42     3  1.44
L[3,3,3]    0.70     4  1.26
L[3,3,4]    0.00   NaN   NaN
L[3,4,1]    0.88     2  3.08
L[3,4,2]    0.32     3  1.33
L[3,4,3]    0.54     4  1.26
L[3,4,4]    0.66   183  1.02
lp__     -380.48     8  1.17

Samples were drawn using NUTS(diag_e) at Sun Nov 25 19:20:57 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

which improves things, but still hasn’t really gotten this to working.

Is there anything else I do to make this sample more reliably? I’d really like to fit GMMs with a whole range of K (from about 5 to 75) and then use loo to determine the number of clusters, but if each individual fit is this finicky I’m not sure this is the right approach.

Exchangeable mixture models don’t work.

Many are familiar with the label switching problem – I wrote about it at https://betanalpha.github.io/assets/case_studies/identifying_mixture_models.html – but few are aware of the many weak identifiabilities that manifest in exchangeable mixture models, especially mixtures of Gaussians where the covariance of each component is being fit at the same time as the locations.
Even once you’ve accounted for label switching there are many configurations of the mixture components that will fit any finite data set. Some are disjoint, some are continuous and spread across wide subspaces in parameter space, but Stan will struggle to explore them all. Keep in mind that these issues are not unique to Stan – Stan just explores fast enough that it doesn’t let you ignore them as other tools might.

For example, multiple components can collapse onto each other and be responsible for a single cluster in the true data generating process many, many different ways. At the same time one components can widen to cover multiple clusters, pushing other components out into the extremes where they’ll attempt to fit stray data. A characteristic property of the resulting fits is that they will be highly sensitive to the initial conditions, which is straightforward to check empirically. Indeed it’s consistent with the behavior that you see here!

Sadly, as you think you’ve compensated for one of these pathologies with priors or constraints you just uncover the many more that have been hiding there the whole time. Unfortunately this is a general problem with attempts to model complex phenomena “non-parametrically” (Gaussian mixture models, Dirchlet processes, Bayesian neural networks, etc). Unless you can arrange the many components very, very carefully then the model configuration space will be highly degenerate for any reasonable data set.

Note that non-exchangeable mixture models, where you one mixes together different distributions responsible for different processes, do not suffer from these problems and are extremely powerful in practice.

1 Like

Okay, so it seems like this approach is a bad idea. Do you have any recommendations on what to try next?

Impossible to say without more information about the application, but in general the more generative the model the easier it will be to engineer a parameterization that admits interpretable models and facilities efficient computation. In other words, instead of trying to model a complex phenomenon with a non-parametric model try to break down the model into layers of orthogonal, approximate behavior.

1 Like

Gotcha. This is informative. I think I was reading the mixture model documentation as being about problems with sampling, rather than the model itself. Thank you.

In practice these can’t really be distinguished – poor sampling problems almost always indicate some issue with the modeling, especially when using tools like HMC. Exchangeable mixture models are an example where there are known problems with the model itself that manifest in poor sampling behavior.

I got it work mostly. There are times when L is still not identifiable but mu is identified. The issue with your model was that the ordering of mu needed to be switched the way you had it. I switched that around and put a hyper prior on the normal distribution for mu (coded as p). I also transformed theta to an unconstrained space which helps the estimation.

data {
  int D; // number of dimensions
  int K; // number of gaussians
  int N; // number of data
  vector[D] y[N]; // data
}

parameters {
  vector[K] theta_raw; // mixing proportions
  ordered[K] mu_raw[D]; // mixture component means
  positive_ordered[K] p;
  cholesky_factor_corr[D] L[K]; // cholesky factor of covariance
}
transformed parameters{
  simplex[K] theta = softmax(theta_raw);
  vector[D] mu[K];

  for (d in 1:D){
    for (k in 1:K){
    mu[k, d] = mu_raw[d, k];
    }
  }
}
model {
  real ps[K];
  
  p ~ normal(0, 3);
  for (k in 1:K) {
    for (d in 1:D) mu[k, d] ~ normal(p[k], 3);
    L[k] ~ lkj_corr_cholesky(4);
  }
  theta_raw ~ normal(0.0, 0.5);

  
  for (n in 1:N) {
    
    for (k in 1:K) {
      //increment log probability of the gaussian
      ps[k] = log(theta[k]) + multi_normal_cholesky_lpdf(y[n] | mu[k], L[k]);
    }
    
    target += log_sum_exp(ps);
  }
}

Call to R

fit <- rstan::stan('stan_help.stan',
            data = mixture_data,
            chains = 2,
            cores = 2,
            control = list(adapt_delta = 0.98, max_treedepth = 12),
            iter =  1000,
            seed = 221
)

Output

                 mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
theta_raw[1]   -0.01    0.01 0.30   -0.62   -0.20   -0.02    0.19    0.54   525 1.00
theta_raw[2]    0.02    0.01 0.31   -0.57   -0.21    0.03    0.22    0.63   710 1.00
theta_raw[3]    0.00    0.01 0.30   -0.58   -0.21    0.00    0.20    0.60   600 1.00
mu_raw[1,1]     0.04    0.01 0.18   -0.35   -0.08    0.05    0.16    0.39   320 1.00
mu_raw[1,2]     2.94    0.01 0.19    2.56    2.82    2.94    3.06    3.34   446 1.00
mu_raw[1,3]     5.99    0.01 0.19    5.62    5.87    5.99    6.11    6.36   540 1.00
mu_raw[2,1]     0.02    0.01 0.18   -0.33   -0.10    0.01    0.13    0.39   284 1.00
mu_raw[2,2]     2.96    0.01 0.20    2.53    2.83    2.96    3.09    3.33   444 1.00
mu_raw[2,3]     6.10    0.01 0.18    5.73    5.98    6.10    6.23    6.45   603 1.00
mu_raw[3,1]     0.03    0.01 0.18   -0.34   -0.10    0.03    0.15    0.38   292 1.01
mu_raw[3,2]     2.92    0.01 0.20    2.53    2.80    2.92    3.04    3.32   435 1.00
mu_raw[3,3]     6.04    0.01 0.19    5.67    5.90    6.04    6.16    6.41   504 1.00
mu_raw[4,1]    -0.04    0.01 0.18   -0.39   -0.16   -0.04    0.07    0.31   335 1.00
mu_raw[4,2]     2.94    0.01 0.20    2.54    2.81    2.95    3.08    3.32   408 1.00
mu_raw[4,3]     5.88    0.01 0.19    5.52    5.75    5.88    6.00    6.24   536 1.00
p[1]            0.86    0.02 0.68    0.03    0.32    0.70    1.27    2.51   798 1.00
p[2]            2.53    0.04 1.03    0.69    1.77    2.53    3.25    4.58   784 1.00
p[3]            5.06    0.04 1.20    2.92    4.23    5.04    5.84    7.55  1108 1.00
L[1,1,1]        1.00     NaN 0.00    1.00    1.00    1.00    1.00    1.00   NaN  NaN
L[1,1,2]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[1,1,3]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[1,1,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[1,2,1]       -0.73    0.05 0.36   -0.92   -0.88   -0.84   -0.77    0.67    59 1.03
L[1,2,2]        0.57    0.01 0.13    0.38    0.47    0.54    0.64    0.95   118 1.02
L[1,2,3]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[1,2,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[1,3,1]        0.01    0.17 0.69   -0.85   -0.73    0.16    0.71    0.86    16 1.09
L[1,3,2]       -0.14    0.05 0.33   -0.71   -0.38   -0.18    0.11    0.52    36 1.04
L[1,3,3]        0.62    0.00 0.12    0.43    0.54    0.61    0.69    0.89   815 1.00
L[1,3,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[1,4,1]       -0.03    0.21 0.73   -0.87   -0.76   -0.39    0.77    0.88    12 1.11
L[1,4,2]        0.11    0.06 0.32   -0.51   -0.15    0.17    0.35    0.68    29 1.05
L[1,4,3]       -0.21    0.02 0.22   -0.60   -0.35   -0.24   -0.10    0.26    81 1.02
L[1,4,4]        0.49    0.00 0.09    0.35    0.43    0.48    0.55    0.72   408 1.00
L[2,1,1]        1.00     NaN 0.00    1.00    1.00    1.00    1.00    1.00   NaN  NaN
L[2,1,2]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[2,1,3]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[2,1,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[2,2,1]        0.61    0.18 0.51   -0.88    0.70    0.79    0.84    0.90     8 1.24
L[2,2,2]        0.60    0.01 0.11    0.42    0.52    0.58    0.67    0.85   428 1.00
L[2,2,3]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[2,2,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[2,3,1]        0.90    0.00 0.04    0.79    0.88    0.91    0.93    0.95   601 1.00
L[2,3,2]        0.15    0.03 0.14   -0.20    0.09    0.17    0.23    0.40    22 1.08
L[2,3,3]        0.38    0.00 0.07    0.28    0.33    0.37    0.42    0.54   791 1.00
L[2,3,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[2,4,1]        0.68    0.18 0.49   -0.84    0.79    0.85    0.88    0.92     7 1.27
L[2,4,2]        0.31    0.01 0.14    0.08    0.23    0.30    0.39    0.60   528 1.01
L[2,4,3]        0.00    0.02 0.11   -0.23   -0.07    0.00    0.07    0.20    53 1.03
L[2,4,4]        0.41    0.00 0.07    0.30    0.36    0.40    0.45    0.58   872 1.00
L[3,1,1]        1.00     NaN 0.00    1.00    1.00    1.00    1.00    1.00   NaN  NaN
L[3,1,2]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[3,1,3]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[3,1,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[3,2,1]        0.41    0.19 0.68   -0.87    0.02    0.78    0.84    0.90    13 1.14
L[3,2,2]        0.60    0.01 0.12    0.43    0.52    0.59    0.67    0.91   506 1.01
L[3,2,3]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[3,2,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[3,3,1]       -0.28    0.56 0.75   -0.91   -0.85   -0.77    0.74    0.88     2 2.18
L[3,3,2]        0.05    0.18 0.29   -0.46   -0.19    0.09    0.26    0.53     3 1.31
L[3,3,3]        0.51    0.00 0.10    0.36    0.44    0.49    0.56    0.76   612 1.01
L[3,3,4]        0.00     NaN 0.00    0.00    0.00    0.00    0.00    0.00   NaN  NaN
L[3,4,1]       -0.39    0.42 0.70   -0.91   -0.86   -0.81    0.55    0.84     3 1.58
L[3,4,2]        0.02    0.14 0.26   -0.43   -0.18    0.01    0.22    0.52     3 1.19
L[3,4,3]        0.28    0.02 0.15   -0.02    0.19    0.27    0.37    0.59    48 1.02
L[3,4,4]        0.43    0.01 0.08    0.30    0.37    0.42    0.47    0.61    98 1.03
theta[1]        0.33    0.00 0.05    0.24    0.30    0.33    0.36    0.43  1396 1.00
theta[2]        0.34    0.00 0.05    0.25    0.30    0.34    0.37    0.44  1308 1.00
theta[3]        0.33    0.00 0.05    0.25    0.30    0.33    0.37    0.43  1269 1.00
mu[1,1]         0.04    0.01 0.18   -0.35   -0.08    0.05    0.16    0.39   320 1.00
mu[1,2]         0.02    0.01 0.18   -0.33   -0.10    0.01    0.13    0.39   284 1.00
mu[1,3]         0.03    0.01 0.18   -0.34   -0.10    0.03    0.15    0.38   292 1.01
mu[1,4]        -0.04    0.01 0.18   -0.39   -0.16   -0.04    0.07    0.31   335 1.00
mu[2,1]         2.94    0.01 0.19    2.56    2.82    2.94    3.06    3.34   446 1.00
mu[2,2]         2.96    0.01 0.20    2.53    2.83    2.96    3.09    3.33   444 1.00
mu[2,3]         2.92    0.01 0.20    2.53    2.80    2.92    3.04    3.32   435 1.00
mu[2,4]         2.94    0.01 0.20    2.54    2.81    2.95    3.08    3.32   408 1.00
mu[3,1]         5.99    0.01 0.19    5.62    5.87    5.99    6.11    6.36   540 1.00
mu[3,2]         6.10    0.01 0.18    5.73    5.98    6.10    6.23    6.45   603 1.00
mu[3,3]         6.04    0.01 0.19    5.67    5.90    6.04    6.16    6.41   504 1.00
mu[3,4]         5.88    0.01 0.19    5.52    5.75    5.88    6.00    6.24   536 1.00
lp__         -384.51    1.06 5.08 -395.27 -387.75 -384.16 -380.82 -375.99    23 1.06

Somewhat related followup: how can Stan beginners distinguish between a model that genuinely doesn’t sample well and a poor parameterization of a model? If you came across a model that you’d never seen before that was doing all sorts of nonsense, what kinds of things would you look for and what sort of diagnostic process would you follow?

I’d say “Statistics beginners”. The thing with Stan (and the underlying HMC implementation) is that (a) it samples fast enough for you to see when things go haywire and (b) it has specific diagnostics that tell you what’s broken, sorta (think tree_depth, divergence and BFMI warnings: they tell you different things). In my opinion there’s no substitute for careful conceptual (does the model make physical sense?), mathematical (is the model sound?) and computational (can I fit this model to fake data?) analysis.

With time you learn to look for “obvious” things like scaling, linear dependency between parameters, etc. Anyway, my 2 pence.

Have you looked at the case studies that have “workflow” in the title? They’re on the mc-stan.org site.

1 Like

Some practical advice on taming divergences (i.e bad sampling)

Some advice on identifiability (i.e. one reason for poor parameterization)

2 Likes