Using a single mixture weight parameter in a multivariate mixture model

I am working with data where around 30% of observations are measured incorrectly and are 2x larger than they should have been recorded. I want to use a mixture model to identify these observations to recover their actual value.

# simulate data
# (MacOS 11.1 and brms 2.14.4)
library(tidyverse)
set.seed(1234)
dat <- tibble(x = 1 + rbinom(300, 1, 0.69)) %>%
  mutate(y1 = x*rnorm(300, 2.5),
         y2 = x*rnorm(300, 5.5)
  )

As x is a hidden variable, what I have so far is:

# fit and check brms model
library(brms)
options(mc.cores = parallel::detectCores())
prior <- c(
  prior(normal(2, 1), class = Intercept, dpar = mu1, resp = y1),
  prior(normal(4, 2), class = Intercept, dpar = mu2, resp = y1),
  prior(normal(5, 1), class = Intercept, dpar = mu1, resp = y2),
  prior(normal(10, 2), class = Intercept, dpar = mu2, resp = y2)
)
fit1 <- brm(mvbind(y1, y2) ~ 1,
            dat, family = mixture(gaussian, nmix = 2),
            prior = prior,
            chains = 8)
summary(fit1)
pp_check(fit1, resp = "y1")
pp_check(fit1, resp = "y2")

From this model, I can get an estimate of each observation being in the incorrectly coded group with pp_mixture on both y1 and y2.

Currently the model creates two independent parameters for the mixture weights theta1 and theta2. However, whilst I don’t know the exact mixture weight, I know that the value of theta1 and theta2 are the same. I would like to either equate the two (theta1 = theta2) or model only one theta parameter. How can I set this in the model?

1 Like

Unfortunately, this is not supported and might never be, because it introduces some difficult issues - I discussed this a bit more broadly at Specifying joint mixture model

Your actual use case, where you know that both components are larger at the same time actually avoids the particular problems (and even better if you actually know exactly the amount of increase) but there is no straightforward way to tell this to brms. I think that you should be able to achieve this particular case with the custom_family feature.

A sketch of the code (I didn’t test it, please double check) - it is a bit hacky, so feel free to ask for clarifications if it is hard to follow. The biggest hack is that we treat this as a univariate custom distribution and abuse the vreal addition term to let us specify the second response. Second, we enforce global ordering by modelling the mixture as a mean of the first component and the positive shift for the second component. For simplicity we assume the same sigma for all components, but the extension to varying should be straightforward.

my_mix <- custom_family(
  "my_mix", dpars = c("mu1", "shift1", "mu2", "shift2", "sigma", "theta"),
  links = c("identity", "log", "identity", "log", "log", "logit"), 
  lb = c(NA, 0, NA, 0, 0, 0),
  ub = c(NA, NA, NA, NA, NA, 1),
  type = "real", vars = "vreal1[n]"
)

stan_funs <- "
  real my_mix_lpdf(real y1, real mu1, real shift1, real mu2, real shift2, real sigma, real theta, real y2) {
     real lp1 = log_mix(theta, normal_lpdf(y1 | mu1, sigma), normal_lpdf(y1 | mu1 + shift1, sigma));
     real lp2 = log_mix(theta, normal_lpdf(y2 | mu2, sigma), normal_lpdf(y2 | mu2 + shift2, sigma));
     return lp1 + lp2;
  }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

fit <- brm(
  y1 | vreal(y2) ~ 1, data = dat,
  family = my_mix, stanvars = stanvars
)

Best of luck with your model and feel free to ask for clarifications (this stuff can be quite intimidating)

1 Like

Hi Martin, thank you so much for your thoughtful reply and code sketch.

For compatability with brms, I changed mu1 to mu and changed shift1, mu2 and shift2 to shiftone, mutwo and shifttwo respectively. This gives:

my_mix <- custom_family(
  "my_mix", dpars = c("mu", "shiftone", "mutwo", "shifttwo", "sigma", "theta"),
  links = c("identity", "log", "identity", "log", "log", "logit"), 
  lb = c(NA, 0, NA, 0, 0, 0),
  ub = c(NA, NA, NA, NA, NA, 1),
  type = "real", vars = "vreal1[n]"
)

stan_funs <- "
  real my_mix_lpdf(real y1, real mu, real shiftone, real mutwo, real shifttwo, real sigma, real theta, real y2) {
     real lp1 = log_mix(theta, normal_lpdf(y1 | mu, sigma), normal_lpdf(y1 | mu + shiftone, sigma));
     real lp2 = log_mix(theta, normal_lpdf(y2 | mutwo, sigma), normal_lpdf(y2 | mutwo + shifttwo, sigma));
     return lp1 + lp2;
  }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

Running the final block,

fit <- brm(
  y1 | vreal(y2) ~ 1, data = dat,
  family = my_mix, stanvars = stanvars
)

gives an error saying: Error: Currently 'dirichlet' is the only valid prior for simplex parameters. Do you have a suggestion on how to add an appropriate dirichlet prior for this model?

1 Like

This is potentially a bug in brms, or just an undocumented piece of itnerface. The error message is triggered by the following condition: prior$class[i] %in% c("simo", "theta", "sbhaz").

I.e. brms thinks anything called theta has to be simplex. Renaming the mixing parameter resolves the issue for me.