Possible to specify which datapoints belong to each distribution in mixture model?

I am building a double-gaussian mixture model. I know the variable (2-level factor) which explains the existence two separate distributions. How can I specify this information in the model?

In the example below, as in my real data, the data y are split between two gaussian distributions (83% and 17% respectively), and each distribution aligns with a factor level in var. The real (non-simulated) data patterns like this, so the simulation isn’t perfect but hopefully sufficient for this question:

My model also contains 7 predictors variables (3 in the example below). In my real model, as I add the final predictors (6+7), the BULK_Ess decrease substantially, Rhats rise, and theta proportions move away closer to 0.5/0.5. I think this might be solved by adding the information about which factor level contributes to each distribution.

Is there a way to tell the model that mu1 is strictly associated factor level 1 of var and mu2 with factor level 2 of the same variable?

Apologies if this is in the documentation and I missed it! Also, apologies if I’m on the completely wrong track. Many thanks in advance.

# Simulated data
var1 <- rep(1, 830)
y1 <- rnorm(830, 13, 22)
data1 <- data.frame(y1, var1)
data1 <- rename(data1, c("y1"="y", "var1"="var"))

var2 <- rep(2, 170)
y2 <- rnorm(170, 43, 22)
data2 <- data.frame(y2, var2)
data2 <-rename(data2, c("y2"="y", "var2"="var"))

sim_data <- rbind(data1, data2)
sim_data$var <- as.factor(sim_data$var)
sim_data$expl1 <- rnorm(1000, 30, 3)
sim_data$expl2 <- rnorm(1000, 118, 21)
sim_data$expl3 <- rnorm(1000, 65, 19)

sim_data %>%
  ggplot(aes(x = y, fill = var))+
  geom_histogram(binwidth = 1)+
  facet_grid(var~.)

# Example model 
mix <- mixture(gaussian, gaussian)
formula = bf(y ~ 1 + expl1 + expl2 + expl3)

test <- brm(formula = formula,
        data = sim_data,
        family = mix,
        chains = 1)

summary(test)
pp_check(test)

  • Operating System: Mac OS 10.15.8
  • brms Version: 2.17.0

If you know the two categories (moving = 1 and moving = 2), that correspond to the two gaussian components than you don’t have to use a mixture model. You can just model them as two groups with a different prior normal(mu[i], sd[i]) for i = 1,2. I am sorry, I am not really familiar how to do that in brms but it is equivalent to a hierarchical model where the mean and sd are allowed to vary.

In my real model, as I add the final predictors (6+7), the BULK_Ess decrease substantially, Rhats rise, and theta proportions move away closer to 0.5/0.5. I think this might be solved by adding the information about which factor level contributes to each distribution.

One possible explanation for this is that the predictors can predict the moving groups pretty well. So, after adjusting for the predictors there is no residual variation that can be explained by the mixture model (i.e. the two components are equivalent). This would explain the .5/.5. This would also explain the sampling problems because the model basically becomes unidentified. If this is the explanation for the results I quoted, then in the approach with two known groups, I would expect mu[1] and mu[2] to be approximately equal.

1 Like

@ryssamoffat: +1 what @stijn said. This is called a distributional model and can be done in brms: Estimating Distributional Models with brms

1 Like

Thank you both very much!

It was not my original intention to include Moving as a variable in the model, but it became clear that this would be necessary after the data were collected.

I fit the recommended distributional model, but found the disadvantage to be that there is only one estimate per parameter, rather than an estimate per parameter per distribution (which in my case provides valuable information).

formula = bf(simDif ~ 1 + Moving + a + b + c + d + e + f + (1|ID), sigma ~ Moving)

To resolve this, I’ve resorted to splitting the data according to the variable Moving which explains the two distributions, and have modelled each smaller dataset separately. I’m satisfied with this solution, but would be interested to know if you might have any other suggestions.

Thanks again!

Do you mean you want to know the sd’s for the two groups?

I think brms should have estimated two parameters sigma_Intercept and sigma_moving on the log scale and I think you can get the estimates by exponentiating the parameters if I read the vignette correctly.

Or do you mean that you want to allow a different effect across the two groups for a, b, c, d, e, and f? If so something like this is roughly equivalent to splitting the data.

formula = bf(simDif ~ Moving * (1 + a + b + c + d + e + f )+ (1|ID), sigma ~ Moving)

You can derive the distribution-specific parametets from the intercept and the coefficient for moving. Or, remove the intercept from your model specification by coding “0+Moving” (that’s a zero). Then you’ll get two estimates (one for each distribution)

Thank you both!