Convergence issues with gamma mixture models in brms: intercepts keep "swapping"

This is a closely related post but I am specifically looking for help with gamma mixture models.

I was able to achieve convergence on a gamma mixture model by setting lb and ub on the intercept priors so that the intercepts didn’t “swap” places. I tried to do this without setting hard bounds on the parameters but no matter how tight I made the sd on the intercept priors, they still managed to swap places unless I set the hard bound to prevent them from swapping.

However, the model that I got to converge is just a simpler version of the real model I want to fit, which also includes a random effect as well as a fixed treatment effect. Note that all the fixed and random effects only act on the mu parameter of the gamma distribution, the shape parameter is common across both treatments. I don’t know if this is appropriate or if I should change that.

This is the simpler model with only a fixed effect that converged:

gammamix2_prior <- c(
  prior(normal(log(0.3), 1), dpar = 'mu1', class = 'Intercept', lb = log(.01), ub = log(.5)),
  prior(normal(log(0.8), 1), dpar = 'mu2', class = 'Intercept', lb = log(.5), ub = log(10)),
  prior(normal(0, 0.1), dpar = 'mu1', class = 'b'),
  prior(normal(0, 0.1), dpar = 'mu2', class = 'b'),
  prior(gamma(1, 1), class = 'shape1'),
  prior(gamma(1, 1), class = 'shape2')
)

fit_gammamix2_area_nore <- brm(bf(area ~ treatment), 
                               data = dat_long, family = mixture(Gamma(link = 'log'), Gamma(link = 'log')), prior = gammamix2_prior,
                               chains = 3, iter = 3000, warmup = 2000, seed = 3567')

And here is the model with an additional random effect. This one didn’t converge. Note I added additional iterations and increased adapt_delta to attempt to deal with divergent transitions.

fit_gammamix2_area <- brm(bf(area ~ treatment + (1 | sample:treatment)), 
                          data = dat_long, family = mixture(Gamma(link = 'log'), Gamma(link = 'log')), prior = gammamix2_prior,
                          chains = 3, iter = 5000, warmup = 4000,
                          control = list(adapt_delta = 0.95), seed = 5567)

Here is what the data look like. First here’s the distributions for the treatments, ignoring the random effect of sample, with the fitted distributions from the models and 95% credible interval to show that the 2-component gamma mixture model is a good fit (for now the fit does not include random term). Incidentally I did try a simpler fit with only a single gamma distribtion (unimodal, not a mixture model) but the fit was very poor so I am really committed to making the mixture model work if possible.

image
(thick lines are observed density curves, thin lines and shaded regions are model fits with 95% credible interval)

And here’s what the individual samples’ distributions look like, to show what kind of random variation we are dealing with.

image

Is there any additional prior specification that might help, or some other tricks? Thanks in advance.

To add some additional context: I can see from the trace plot that the posterior for both thetaparameters is bimodal with modes of roughly zero and one. Some chains are getting stuck at zero and some are getting stuck at one. So it looks like it isn’t really fitting a mixture model that has reasonable proportions of both components. I see the default prior for theta is dirichlet(1). How should I modify this prior to put more weight on thetas closer to 0.5, and low or zero weight on thetas closer to 0 and 1?

I have tried setting dirichlet(10) prior on theta, and a tight prior of exponential(10) on the standard deviation of the random intercepts for mu1 and mu2, to keep the sampler from trying to make those random effects too big. Together, those things seem to have helped a lot. It often seems to be the case that I have to tighten up those priors way more than I first thought … maybe that is because I have poor intuition on what reasonable priors should be?

Just a small thing, your priors might still make it hard for the model to differentiate the modes for the intercept, I plotted them here:


So the question is why you use such a wide standard deviation and then truncate it instead of using more distinct priors. Wide in the sense that the standard deviation covers the entire distance between both prior means and due to the tail behavior of the normal distribution even beyond that. I would be curious to see the trace plots, especially without the bounds.

the shape parameter is common across both treatments. I don’t know if this is appropriate or if I should change that.

If you have a reason to think that the shape parameter is a function of your data then you could model it as well.

And here is the model with an additional random effect. This one didn’t converge.

How many observations do you have per sample:treatment ? If there are not enough, mcmc might struggle to estimate the sd of the groups.

I can see from the trace plot that the posterior for both theta parameters is bimodal with modes of roughly zero and one

That sounds to me like mcmc thinks that only one of the models is necessary. That might however be due to the problems with the non-converging as both mixture components might be too similar.

Finally I want to ask if you have a theory where the two mixture parts come from. Are there different processes at work that you can’t explicitly add to a non-mixture model? If so, maybe develop both mixture parts individually, look at their joint priors (via sample_prior = “only”) and then fit a mixture model.

1 Like

@scholz Thanks a lot for the thoughtful comments. I agree that the modes are very close together so the model is having a hard time with them. I ended up getting the model to converge on one dataset by decreasing the sigma to 0.25. But I am still struggling to get a model to converge fitting to a different dataset.

To answer some of your questions,

If you have a reason to think that the shape parameter is a function of your data then you could model it as well.

Can you please help with the formula syntax to model the shape parameter as a function of treatment as well, in case I want to pursue that?

How many observations do you have per sample:treatment ? If there are not enough, mcmc might struggle to estimate the sd of the groups.

There are roughly 200 observations per sample:treatment which seems adequate to me but maybe not for a mixture distribution?

Finally I want to ask if you have a theory where the two mixture parts come from. Are there different processes at work that you can’t explicitly add to a non-mixture model?

I see your point here. I am a little fuzzy on the underlying theory but it seems like there are good reasons to assume bimodality. These are droplets coming out of a spray nozzle. Because of the surface tension they tend to cluster around certain sizes which can either coalesce or break up, leading to multiple “peaks” in the distribution. And when I did get the gamma mixture to converge for one dataset and used LOOCV to compare it to a single component gamma, the difference was >50 with S.E. diff of 10, giving some evidence that the mixture model is pretty well supported.

Anyway, thanks again for your help so far. If there’s any way you can help with the syntax for the variable shape parameter, that would be very helpful.

I ended up getting the model to converge on the other dataset. I kept reducing the standard deviation on the prior distribution for the mu1 and mu2 intercepts, first to 0.25, then to 0.1, then all the way down to 0.05, when it finally converged. I think that is reasonable especially because the real inference I care about is the treatment effect, not the intercept.

There is a nice brms vignette about that here

That sounds like plenty.

While it might be a hard read, here is some additional material regarding mixtures from Michael Betancourt:

1 Like