Model weight of mixture components

I have data on the time a system was active in each state (e.g., 0, 1, and 2). The aggregate time distribution is plotted in Fig A. My goal is to model the time distribution for each state and the weight of each component (Fig. B).

The data was generated as:

library('tidyverse')

y0 <- rlnorm(100, -1, 0.5)
s0 <- rep(0, 100)

y1 <- rlnorm(500, 2, 0.5)
s1 <- rep(1, 500)

y2 <- rlnorm(250, 1, 0.5)
s2 <- rep(2, 250)

d <- tibble(y = c(y0, y1, y2),
            s = c(s0, s1, s2)) %>%
  mutate(s = as.factor(s))

I used a log-normal regression to estimate the parameters of each state duration distribution as:

m <- brm(y ~ 0 + s, 
         data=d, 
         family = lognormal())

that was successful to retrieve the real parameters (s0 ~ -1, s1~2, s2~1):

 Family: lognormal 
  Links: mu = identity; sigma = identity 
Formula: y ~ 0 + s 
   Data: d (Number of observations: 850) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
s0    -0.99      0.05    -1.09    -0.89 1.00     4400     3006
s1     1.98      0.02     1.94     2.02 1.00     4886     3177
s2     1.00      0.03     0.94     1.05 1.00     4142     2551

Family Specific Parameters (in Fig. C each component is normalized as a PDF): 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.49      0.01     0.46     0.51 1.00     4618     3164

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Now, I would like to estimate the mix of the three components. In Fig. D I weighted each component by the proportion w of data for each state I have in the dataset.

w <- d %>%
  group_by(s) %>%
  summarise(count = n()) %>%
  mutate(tot = sum(count)) %>%
  mutate(w = count/tot)

#  s     count   tot     w
#  <fct> <int> <int> <dbl>
# 1 0       100   850 0.118
# 2 1       500   850 0.588
# 3 2       250   850 0.294

Is this a good way to get the mix weight? How could I instead estimate this parameter from the data I have? I was thinking I could use a binomial regression like:

brm(count | trials(tot) ~ 0 + s, data = w,
          family = binomial(), file = 'mix.rds')

but I am not sure this is the way to go. Any ideas?

1 Like

I don’t think I understand exactly what you are trying to achieve in the end. Does the real data you are trying to analyze only contain three rows, with the total times for each state? Or does it contain just durations but the state is unknown? Or something else? The approach would IMHO differ noticeably between those options.

Maybe what you are trying to achieve is a mixture model which can be written with mixture (Finite Mixture Families in brms — mixture • brms)?

Best of luck with your model!

Thanks for the input! I am trying to achieve something like a mixture model, with the important detail that I already know what data point belongs to each component (Figure B). I do not think you can assign a point to a component when fitting mixture models, am I correct? So, what I am trying to do is 1) to fit each component independently and get the probability distribution for each state (Figure C), and 2) figure out the mixing weight of each state distribution (Figure D). I think I got the first part right, but I struggle to understand how to estimate the weight. What I tried—but I think is the wrong way—is to estimate the weight as a binomial regression based on the number of data points for each state in the dataset.

So if I understand you correctly, your problem could actually be stated as 4 completely separate inference problems:

  1. Estimate the proportion of data points that go into each component - since you have 3 components, this is IMHO well handled directly fitting an intercept-only multinomial model on a single-row dataset with just the total counts of data points in each component (see Example with family "multinomial" for a simple example)

  2. For each component separately fit the parameters of the lognormal.

If this is right, i.e. if you see no way how those separate problems could share information, you can just fit 4 separate models (your current solution fits 3 completely independent means, but one shared sigma - this might or might not be desirable).

You are correct, mixture models are designed for the case where you don’t know which component the data point belongs to.

1 Like

Lovely, this makes sense. Thanks!