Convergence issues with brms mixture models

Hi there,

I’ve recently started trying to fit mixture models (using brms) to account for bimodality in my response distribution. While the posterior predictive distribution seems to fit the data reasonably well using a mix of two gaussian distributions (at least better than using a standard non- mixture model), I am now getting convergence issues where before I had no problems whatsoever getting my models to converge. It also seems that my models take a lot longer to run than what I’m used to (one model literally took roughly 24 hours to run).

Could anyone point me towards why I am seeing these convergence issues with mixture models where I wasn’t having any before and knows some way to fix this? I am a relative beginner with both brms and even more so with mixture models.

Please see below for my code:

mix <- mixture(gaussian, gaussian)

prior <- c(prior(normal(-2,10), Intercept, dpar = mu1),
               prior(normal(7,10), Intercept, dpar = mu2),
               prior(normal(0,10), b, dpar = mu1),
               prior(normal(0,10), b, dpar = mu2),
               prior(cauchy(0,.5), sd, dpar = mu1),
               prior(cauchy(0,.5), sd, dpar = mu2))

mixture_model <- brm(bf(formula = accuracy ~ drug +
                          (1 | sub) +
                          (1 | item)),
                     data = dat, 
                     family = mix,
                     warmup = 1000, iter = 5000, 
                     cores = parallel::detectCores(),
                     chains = 4, control = list(adapt_delta = .99), 
                     prior = prior, sample_prior = TRUE,
                     save_pars = save_pars(all = TRUE))

And these are the convergence warnings I’m getting:

Warning messages:
1: Rows containing NAs were excluded from the model. 
2: There were 5820 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. 
3: There were 8002 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
4: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
https://mc-stan.org/misc/warnings.html#bfmi-low 
5: Examine the pairs() plot to diagnose sampling problems
 
6: The largest R-hat is 2.62, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat 
7: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess 
8: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess

And below for some system specs:
OS: MacOS Monterey 12.3 (M1 Mac)
R version 4.3.1
brms version 2.16.3

Highly appreciate any kind of advice on this!

Could you check (and post) your trace plots? I suspect that your model might have problems with separating the modes without priors to push them apart. You might see some chains converge on one mode and some on another.

Hi, thanks for your reply. Exactly what you’re describing seems to be happening, every time it’s a different chain that doesn’t converge and very rarely the model even converges. Just not reliably.

See below for an example trace plot:

Would you mind explaining what you mean by ‘problems with separating the modes without priors to push them apart’?

Thanks a lot!

My guess is that the sd of your intercept priors is way too wide. It allows both intercepts parameters to cover both modes. So my first step would be to reduce the sd so that the mixture parts don’t reasonably overlap with both modes.

2 Likes

Thanks a lot for that tip, that does indeed seem to have solved my problem! Chains now perfectly mix.

Nice. just as a small illustration, this is what those priors looked like. You can see that there is a large area where they overlap aka both mixture parts explore that space and might catch a mode on their journey.

Just for fun, you might want to fit one of those mixture components with the sample_prior = "only" option and look at the pp_check output for how those priors you specified translate onto the outcome scale. I would guess that they produce unreasonable large results.

3 Likes

Thank you again very much for your help and your recent illustration. It seems that both defining smaller SDs for the mixture components’ intercept priors as well as defining the correct proportions of mixture components 1 and 2 (using theta) helped convergence in my case - although I’m a bit surprised that I had to set the SDs as low as 0.1 to get convergence, while plotting suggests I should be able to use higher SD values and still be able to discriminate the two distributions.