Convergence issues with over-fitting mixtures

Hi everyone,

I am trying to fit a Spare Finite Mixture (SFM) (Malsiner-Walli et al. 2016) of Normal distributions using Stan in R. The idea of the SFM approach is to use many components but shrink the weights of superfluous components to zero. I am testing the model with the Galaxy dataset.

The problem is that the MCMC algorithm does not converge when some of the components are close to zero or the lower bound. I get the warning “Warning: There were X divergent transitions after warmup”. It works fine with 2 or 3 components but breaks done with more than that. Alternatively it is possible to use many components by forcing them to have similar weights (by setting e_0 large in the code below). But this is not interesting.

Note that parameters’ Rhats are mostly always bad because of label switching. We don’t mind label-switching in itself because we are only interested in the mixture, not its components. To assess convergence we can look at a graph of the mixture at each iteration and assess the Rhat of lp__ . Ordering one of the parameters does not solve the issue of divergent transitions.

The Stan code for the mixture is :

data {
  int<lower=1> K;          // number of mixture components
  int<lower=1> N;          // number of data points
  real y[N];               // observations
  real<lower=0> e0;        // prior - number of components
  real<lower=0> a0;        // prior - number of components
  real<lower=0> A0;        // prior - number of components
  real b0;                 // prior - mean
  real<lower=0> B0;        // prior - mean
  real<lower=0> c0;        // prior - variance 
  real<lower=0> g0;        // prior - variance 
  real<lower=0> G0;        // prior - variance 
  int<lower = 0, upper = 1> SFM;
  int<lower = 0, upper = 1> hier;
  int<lower = 0, upper = 1> conj;
}

parameters {
  simplex[K] theta_0;   
  vector[K] mu;
  vector<lower=0>[K] sigmaSQ;
  vector<lower=0>[SFM ? 1 : 0] alpha;   // parameter mixing proportions
  vector<lower=0>[hier ? K : 0] C0;      // see https://discourse.mc-stan.org/t/if-else-statement-inside-parameter-block/13937/3
}

transformed parameters {
  vector<lower=0>[K] sigma;                // scales of mixture components
  
  simplex[K] theta;
  theta = 0.1 + (1-0.1*K)*theta_0;
  
  for (i in 1:K) {
    sigma[i] = sqrt(sigmaSQ[i]);
  }
  
}

model {
  vector[K] log_theta;
  vector[K] lps;
  
  for (k in 1:K) {
    if (hier) {
      C0[k] ~ gamma(g0, G0);
      sigmaSQ[k] ~ inv_gamma(c0, C0[k]);
    } else {
      sigmaSQ[k] ~ inv_gamma(c0, g0);
    }
    
    if (conj) {
      mu[k] ~ normal(b0, B0*sigmaSQ[k]);
    } else {
      mu[k] ~ normal(b0, B0);
    }
  }
  
  if (SFM) {
    alpha[1] ~ gamma(a0, A0);
    theta_0 ~ dirichlet(rep_vector(alpha[1], K));
  } else {
    theta_0 ~ dirichlet(rep_vector(e0, K));
  }
  
  log_theta = log(theta);
  
  for (n in 1:N) {
    lps = log_theta;
    for (k in 1:K) {
      lps[k] += normal_lupdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lps);
  }
} 

in Rmarkdown with {stan output.var=“normal”}

The R code is

y = multimode::galaxyrg

library(rstan)
K = 4

iter = 2000
data = y

fit <- sampling(
  object = normal,  # Stan program
  data = list(K = K,
              N = length(data),
              y = data,
              a0 = 10,
              A0 = 10*K,
              b0 = median(data),
              B0 = (max(data) - min(data))^2,
              c0 = 2.5,
              e0 = 1/K,
              g0 = 0.5,
              G0 = 100*0.5/2.5/(max(data) - min(data))^2,#0.5/(sd(y)^2/2),
              hier = F,
              conj = F,
              SFM = F
  ),
  control = list(adapt_delta = 0.999),
  # control = list(adapt_delta = 0.999, stepsize = 0.001, max_treedepth = 20),
  chains = 4,          
  warmup = iter/2,       
  iter = iter,         
  cores = 4,             
  refresh = iter/10           
)

I have tried to put a lower bound on the mixture weights by setting

simplex[K] theta_0; 

in the parameter block and

simplex[K] theta;
theta = 0.1 + (1-0.1*K)*theta_0;

in the transformed parameters block, but but this does not appear to help. And I have also tried

control = list(adapt_delta = 0.999, stepsize = 0.001, max_treedepth = 20),

but this makes estimation a lot slower and does not solve the problem aither.

A bit of background :
I am the maintainer of BayesMultiMode, an R package for exploring multimodality using Bayesian mixture methods. Currently the package only supports a (shifted) Poisson distribution. We are trying to extend the package to continuous mixtures and thought Stan would be convenient for that. I have created a new branch with Stan code incorporated here.

I wonder if you have any suggestion that could help improve convergence when modelling many components ?

Thanks for reading!

Beyond that, I’d guess you have problems with identifiability with the weights and the parameters for distribution means, that may explain poor performance and have some rapid deterioration with the increase in the number of components, even if there wasn’t any label switches.