Hi,
I’ve been trying to understand if a finite mixture model will help my project, and so I began with Paul Burkner’s helpful but brief vignette on mixture modelling in brms (Finite Mixture Families in brms — mixture • brms).
However my project is a clustering problem, so I’m particularly interested in obtaining the probability each data point arose under the different mixture components.
In this particular case, I’m modelling changes in test scores from people, where test scores are collected repeatedly over time and I’m interesting in identifying the people whose trajectory improves.
A replicable example is shown below:
library(tidyverse)
library(brms)
# create a data set from two different linear eqns (clusters):
# y ~ -1 + 2*x
# y ~ 1 - 2*x
#
# Persons are identifed by w, and each cluster is a set of unique persons
set.seed(1) # for replicability
x = rnorm(100)
data.frame(
y = rnorm(100, -1 + 2*x), # mu_I = -1, mu_x = +2
x = x,
w = sample(LETTERS[1:10], 100, replace = T) # persons A to J
) -> d1
x = rnorm(100)
data.frame(
y = rnorm(100, 1 - 2*x), # mu_I = +1, mu_x = -2
x = x,
w = sample(LETTERS[11:20], 100, replace = T) # persons K to T
) -> d2
dat <- bind_rows(d1, d2)
# Set up the model
f <- bf(y ~ 1 + x + (1|w))
mix = mixture(gaussian, gaussian)
# Informed priors to help sampling in this example:
prior <- c(
prior(normal(-1, 3), Intercept, dpar = mu1),
prior(normal(1, 3), Intercept, dpar = mu2),
prior(normal(2, 3), b, dpar = mu1),
prior(normal(-2, 3), b, dpar = mu2)
)
fit <- brm(formula = f,
data = dat,
family = mix,
prior = prior,
chains = 2,
seed = 1)
summary(fit)
# Family: mixture(gaussian, gaussian)
# Links: mu1 = identity; sigma1 = identity; mu2 = identity; sigma2 = identity;
# theta1 = identity; theta2 = identity
# Formula: y ~ 1 + x + (1 | w)
# Data: dat (Number of observations: 200)
# Draws: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
# total post-warmup draws = 2000
#
# Group-Level Effects:
# ~w (Number of levels: 20)
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sd(mu1_Intercept) 0.22 0.18 0.01 0.66 1.00 638 861
# sd(mu2_Intercept) 0.61 0.35 0.06 1.44 1.02 189 363
#
# Population-Level Effects:
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# mu1_Intercept -1.17 0.15 -1.46 -0.87 1.00 763 980
# mu2_Intercept 0.92 0.24 0.36 1.35 1.00 454 208
# mu1_x 1.94 0.12 1.69 2.18 1.00 1443 758
# mu2_x -1.94 0.14 -2.19 -1.64 1.00 995 766
#
# Family Specific Parameters:
# Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma1 0.89 0.09 0.71 1.08 1.00 903 1108
# sigma2 1.06 0.10 0.88 1.27 1.00 1060 1184
# theta1 0.47 0.05 0.37 0.57 1.00 839 646
# theta2 0.53 0.05 0.43 0.63 1.00 839 646
#
# Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
# and Tail_ESS are effective sample size measures, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).
Is there a way to identify which mixture component each person (w
) came from? I hope this question makes sense. I’ve read comments on the Stan user guide which makes me think it is possible (5.5 Inferences supported by mixtures | Stan User’s Guide), and I’m hoping something similar can be done using a post estimation procedure with brms…
Thanks,
Rich