Help obtaining the probability each data point arose under each mixture component

Hi,

I’ve been trying to understand if a finite mixture model will help my project, and so I began with Paul Burkner’s helpful but brief vignette on mixture modelling in brms (Finite Mixture Families in brms — mixture • brms).

However my project is a clustering problem, so I’m particularly interested in obtaining the probability each data point arose under the different mixture components.

In this particular case, I’m modelling changes in test scores from people, where test scores are collected repeatedly over time and I’m interesting in identifying the people whose trajectory improves.

A replicable example is shown below:

library(tidyverse)
library(brms)

# create a data set from two different linear eqns (clusters):  
# y ~ -1 + 2*x
# y ~ 1 - 2*x
# 
# Persons are identifed by w, and each cluster is a set of unique persons 

set.seed(1) # for replicability
x = rnorm(100)
data.frame(
  y = rnorm(100, -1 + 2*x), # mu_I = -1, mu_x = +2
  x = x,
  w = sample(LETTERS[1:10], 100, replace = T) # persons A to J
) -> d1

x = rnorm(100)
data.frame(
  y = rnorm(100, 1 - 2*x), # mu_I = +1, mu_x = -2
  x = x,
  w = sample(LETTERS[11:20], 100, replace = T) # persons K to T
) -> d2

dat <- bind_rows(d1, d2) 

# Set up the model
f <- bf(y ~ 1 + x + (1|w))
mix = mixture(gaussian, gaussian)

# Informed priors to help sampling in this example:
prior <- c(
  prior(normal(-1, 3), Intercept, dpar = mu1),
  prior(normal(1, 3), Intercept, dpar = mu2),
  prior(normal(2, 3), b, dpar = mu1),
  prior(normal(-2, 3), b, dpar = mu2)
)


fit <- brm(formula = f, 
           data = dat,
           family = mix,
           prior = prior,
           chains = 2,
           seed = 1)

summary(fit)

#  Family: mixture(gaussian, gaussian) 
#   Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity; 
# theta1 = identity; theta2 = identity 
# Formula: y ~ 1 + x + (1 | w) 
#    Data: dat (Number of observations: 200) 
#   Draws: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
#          total post-warmup draws = 2000
# 
# Group-Level Effects: 
#   ~w (Number of levels: 20) 
#                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sd(mu1_Intercept)     0.22      0.18     0.01     0.66 1.00      638      861
# sd(mu2_Intercept)     0.61      0.35     0.06     1.44 1.02      189      363
# 
# Population-Level Effects: 
#               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# mu1_Intercept    -1.17      0.15    -1.46    -0.87 1.00      763      980
# mu2_Intercept     0.92      0.24     0.36     1.35 1.00      454      208
# mu1_x             1.94      0.12     1.69     2.18 1.00     1443      758
# mu2_x            -1.94      0.14    -2.19    -1.64 1.00      995      766
# 
# Family Specific Parameters: 
#        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma1     0.89      0.09     0.71     1.08 1.00      903     1108
# sigma2     1.06      0.10     0.88     1.27 1.00     1060     1184
# theta1     0.47      0.05     0.37     0.57 1.00      839      646
# theta2     0.53      0.05     0.43     0.63 1.00      839      646
# 
# Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
# and Tail_ESS are effective sample size measures, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).

Is there a way to identify which mixture component each person (w) came from? I hope this question makes sense. I’ve read comments on the Stan user guide which makes me think it is possible (5.5 Inferences supported by mixtures | Stan User’s Guide), and I’m hoping something similar can be done using a post estimation procedure with brms…

Thanks,

Rich

1 Like

Sorry about the delay in getting to your questions. I saw somewhere a nice bit of code for doing just this but I totally did not bookmark the site. I’ll keep digging around. Hopefully someone else here might have a faster answer.

1 Like