Specifying a mixture model

Please also provide the following information in addition to your question:

  • Operating System: OSX 10.14.5
  • brms Version: 2.8.0

Hi,

I have data on the measurements of diameter (here in mm) of a skin test that look this this:

library(tidyverse)
library(brms)
#> Loading required package: Rcpp
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> Loading 'brms' package (version 2.8.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').

set.seed(1234)

test <- data.frame(result = c(rpois(500, 0.25), rpois(150, 10)))

test %>%
  ggplot() +
  geom_bar(aes(x=result))

Created on 2019-07-15 by the reprex package (v0.3.0)

There are two underlying processes to be estimated:

  1. Between 0 and around 4, most people probably don’t have the disease, although some will
  2. At 5 or above, most people probably do have the disease, although some people with a smaller result may in fact have disease

I have been reading around mixture models as implemented in brms, but have been struggling to specify correctly for my data. I have tried the following, with a message about not mixing families:

library(brms)
#> Loading required package: Rcpp
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> Loading 'brms' package (version 2.8.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
mix <- mixture(poisson, gaussian)
#> Error: Cannot mix families with real and integer support.

prior <- c(
  prior(poisson(0.25), Intercept, dpar = mu1),
  prior(gaussian(10, 4), Intercept, dpar = mu2)
)

fit <- brm(bf(result ~ 1), 
            data = test,
            family = mix, 
           prior = prior,
           chains = 1)
#> Error in validate_formula.brmsformula(formula, data = data, family = family, : object 'mix' not found

pp_check(fit)
#> Error in pp_check(fit): object 'fit' not found

Created on 2019-07-15 by the reprex package (v0.3.0)

Thanks.

1 Like

I’m not an expert and fairly new to using library(brms) myself. Have you tried making a mixture family that’s both Poisson’s? Your example data was generated using 2 Poissons, so since (in this example) you are privy to the underlying process it would make sense to do that.

Thanks so much - yes, I did try that too, but without success:

library(brms)
#> Loading required package: Rcpp
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> Loading 'brms' package (version 2.8.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
mix <- mixture(poisson, poisson)
#> Setting order = 'mu' for mixtures of the same family.

prior <- c(
  prior(poisson(0.25), Intercept, dpar = mu1),
  prior(poisson(10), Intercept, dpar = mu2)
)

fit <- brm(bf(result ~ 1), 
            data = test,
            family = mix, 
           prior = prior,
           chains = 1)
#> Error in isTRUE(attr(data, "brmsframe")): object 'test' not found

pp_check(fit)
#> Error in pp_check(fit): object 'fit' not found

Created on 2019-07-15 by the reprex package (v0.3.0)

Gives the error message:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

No matches for: 

  poisson_lpdf(real, real)

Function poisson_lpdf not found.
  error in 'modelb44f45f46c52_fileb44f6d3f6de' at line 29, column 51
  -------------------------------------------------
    27:   vector[N] mu2 = temp_mu2_Intercept + rep_vector(0, N);
    28:   // priors including all constants
    29:   target += poisson_lpdf(temp_mu1_Intercept | 0.25);
                                                          ^
    30:   target += poisson_lpdf(temp_mu2_Intercept | 10);
  -------------------------------------------------

Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname,  : 
  failed to parse Stan model 'fileb44f6d3f6de' due to the above error.
 SYNTAX ERROR, MESSAGE(S) FROM PARSER:

No matches for: 

  poisson_lpdf(real, real)

Function poisson_lpdf not found.
  error in 'modelb44f71642f97_fileb44f6d3f6de' at line 29, column 51
  -------------------------------------------------
    27:   vector[N] mu2 = temp_mu2_Intercept + rep_vector(0, N);
    28:   // priors including all constants
    29:   target += poisson_lpdf(temp_mu1_Intercept | 0.25);
                                                      ^
    30:   target += poisson_lpdf(temp_mu2_Intercept | 10);
 -------------------------------------------------

Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname,  : 
  failed to parse Stan model 'fileb44f6d3f6de' due to the above error.

This looks strange. Do you have a full reproducible example for me?

Not OP, but I was testing his code and have a reprex demonstrating his exact problem:

library(tidyverse)
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.9.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').

set.seed(1234)

test <- data.frame(result = c(rpois(500, 0.25), rpois(150, 10)))

mix <- mixture(poisson, poisson)
#> Setting order = 'mu' for mixtures of the same family.

prior <- c(
    prior(poisson(0.25), class = "Intercept", dpar = "mu1"),
    prior(poisson(10), class = "Intercept", dpar = "mu2")
)

fit.brm <- brm(bf(result ~ 1), 
           data = test,
           family = mix, 
           prior = prior,
           chains = 1)
#> SYNTAX ERROR, MESSAGE(S) FROM PARSER:
#> 
#> No matches for:
#> 
#>   poisson_lpdf(real, real)
#> 
#> Function poisson_lpdf not found.
#>   error in 'model744555c541_file74455ad5a546' at line 29, column 52
#>   -------------------------------------------------
#>     27:   vector[N] mu2 = temp_mu2_Intercept + rep_vector(0, N);
#>     28:   // priors including all constants
#>     29:   target += poisson_lpdf(temp_mu1_Intercept | 0.25);
#>                                                            ^
#>     30:   target += poisson_lpdf(temp_mu2_Intercept | 10);
#>   -------------------------------------------------
#> 
#> Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname,  : 
#>   failed to parse Stan model 'file74455ad5a546' due to the above error.
#> SYNTAX ERROR, MESSAGE(S) FROM PARSER:
#> No matches for:
#> 
#>   poisson_lpdf(real, real)
#> 
#> Function poisson_lpdf not found.
#>   error in 'model744533d0b466_file74455ad5a546' at line 29, column 52
#>   -------------------------------------------------
#>     27:   vector[N] mu2 = temp_mu2_Intercept + rep_vector(0, N);
#>     28:   // priors including all constants
#>     29:   target += poisson_lpdf(temp_mu1_Intercept | 0.25);
#>                                                            ^
#>     30:   target += poisson_lpdf(temp_mu2_Intercept | 10);
#>   -------------------------------------------------
#> 
#> Error in stanc(model_code = paste(program, collapse = "\n"), model_name = model_cppname, : failed to parse Stan model 'file74455ad5a546' due to the above error.

Created on 2019-07-15 by the reprex package (v0.3.0)

At the risk of stating the obvious, the Stan code should probably be poisson_lpmf instead of _lpdf. Hope this helps!

Ah, now that I look at this code, indeed, why did you specify a poison prior for the intercepts? This can’t possibly work as parameters in stan are continuous while the poisson is a discrete distribution.

1 Like

Okay - that makes sense. I guess I am still struggling to conceptualise how to specify priors to reflect my understanding of the data here. I think my difficultly is in ensuring that the first distribution doesn’t fall below zero, which wouldn’t be possible in the data:

library(brms)
#> Loading required package: Rcpp
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> Loading 'brms' package (version 2.8.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
test <- data.frame(result = c(rpois(500, 0.25), rpois(150, 10)))

mix <- mixture(gaussian, gaussian)
#> Setting order = 'mu' for mixtures of the same family.

prior <- c(
  prior(normal(0, 2), Intercept, dpar = mu1),
  prior(normal(0, 2), Intercept, dpar = mu2)
)

fit <- brm(bf(result ~ 1), 
            data = test,
            family = mix, 
           prior = prior,
           chains = 1)
#> Compiling the C++ model
#> Start sampling
#> 
#> SAMPLING FOR MODEL '7b517954ea2a96396e16f5dec78c77c2' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 0.000286 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.86 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 1.04822 seconds (Warm-up)
#> Chain 1:                1.2112 seconds (Sampling)
#> Chain 1:                2.25942 seconds (Total)
#> Chain 1:

summary(fit)
#>  Family: mixture(gaussian, gaussian) 
#>   Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; theta1 = identity; theta2 = identity 
#> Formula: result ~ 1 
#>    Data: test (Number of observations: 650) 
#> Samples: 1 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 1000
#> 
#> Population-Level Effects: 
#>               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> mu1_Intercept     0.20      0.02     0.17     0.24       1479 1.00
#> mu2_Intercept     9.04      0.30     8.43     9.62       1197 1.00
#> 
#> Family Specific Parameters: 
#>        Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> sigma1     0.40      0.01     0.38     0.43       1241 1.00
#> sigma2     3.56      0.22     3.17     4.02        958 1.00
#> theta1     0.74      0.02     0.71     0.78       1201 1.00
#> theta2     0.26      0.02     0.22     0.29       1201 1.00
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
#> is a crude measure of effective sample size, and Rhat is the potential 
#> scale reduction factor on split chains (at convergence, Rhat = 1).
pp_check(fit)
#> Using 10 posterior samples for ppc type 'dens_overlay' by default.

Created on 2019-07-16 by the reprex package (v0.3.0)

I am struggling to work out how to specify this prior, and grateful for any suggestions.

Thanks

What you are looking for is not a prior but a likelihood (specified via the family argument). In this case, a mixture of two poisson distributions. Use mixture(poisson, poisson) instead of mixture(gaussian, gaussian).

Thanks - I have tried that, but get an error message:

Error: Family 'mixture(poisson, poisson)' requires response greater than or equal to 0.

But didn’t you create data according to a poisson distribution or is your actual data structured differently?

You are correct. In trying to get working, I had changed my test data to see if that works. With this specified as:

library(tidyverse)
library(brms)
#> Loading required package: Rcpp
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> Loading 'brms' package (version 2.8.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').

set.seed(1234)

test <- data.frame(result = c(rpois(500, 0.25), rpois(150, 10)))

test %>%
  ggplot() +
  geom_bar(aes(x=result))


mix <- mixture(poisson, poisson)
#> Setting order = 'mu' for mixtures of the same family.

prior <- c(
  prior(cauchy(0,5), Intercept, dpar = mu1),
  prior(normal(0,2), Intercept, dpar = mu2)
)

fit <- brm(bf(result ~ 1), 
            data = test,
            family = mix, 
           prior = prior,
           chains = 1)
#> Compiling the C++ model
#> Start sampling
#> 
#> SAMPLING FOR MODEL '46b36408f139ed6444538aceb75a3d47' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 0.000296 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.96 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 4.23044 seconds (Warm-up)
#> Chain 1:                3.74466 seconds (Sampling)
#> Chain 1:                7.9751 seconds (Total)
#> Chain 1:

summary(fit)
#>  Family: mixture(poisson, poisson) 
#>   Links: mu1 = log; mu2 = log; theta1 = identity; theta2 = identity 
#> Formula: result ~ 1 
#>    Data: test (Number of observations: 650) 
#> Samples: 1 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup samples = 1000
#> 
#> Population-Level Effects: 
#>               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> mu1_Intercept    -1.47      0.10    -1.66    -1.29        286 1.00
#> mu2_Intercept     2.26      0.03     2.21     2.31       1060 1.00
#> 
#> Family Specific Parameters: 
#>        Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
#> theta1     0.77      0.02     0.73     0.80        441 1.00
#> theta2     0.23      0.02     0.20     0.27        441 1.00
#> 
#> Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
#> is a crude measure of effective sample size, and Rhat is the potential 
#> scale reduction factor on split chains (at convergence, Rhat = 1).
pp_check(fit)
#> Using 10 posterior samples for ppc type 'dens_overlay' by default.

Created on 2019-07-16 by the reprex package (v0.3.0)

This now works. Thanks so much for your help - has really helped my understanding here. Much appreciated.