Bayesian Mixture Model

Hi everyone, I am trying to run a Bayesian Mixture Model to find latent subgroups in my longitudinal panel data (I have 150,000 observations). I use the below model:

data {
  int<lower=0> N;  // Number of observations
  int<lower=1> K;  // 
  int<lower=1> patient_id[N];  // Patient IDs
  vector[N] age_months;  // Age in months
  int<lower=0,upper=1> allergy[N];  // Allergy status (0 or 1)
}

parameters {
  real alpha[K];  // Intercept for each subgroup
  real beta[K];  // Slope for age for each subgroup
  real<lower=0> sigma[K];  // Standard deviation for each subgroup
  simplex[K] theta;  // Mixing proportions
  simplex[K] z[N];  // Latent subgroup membership probabilities
}

model {
  // Priors
  for (k in 1:K) {
    alpha[k] ~ normal(0, 5);
    beta[k] ~ normal(0, 1);
    sigma[k] ~ cauchy(0, 2);
  }
  theta ~ dirichlet(rep_vector(1.0, K));
  
  // Likelihood
  for (n in 1:N) {
    real ps[K];
    for (k in 1:K) {
      ps[k] = log(theta[k]) + normal_lpdf(allergy[n] | alpha[k] + beta[k] * age_months[n], sigma[k]);
    }
    target += log_sum_exp(ps);
  }
}

And I use these codes in R using ‘rstan’ to run the model:

stan_data <- list(
  N = nrow(my_data),  # Number of observations
  K = 3, #latent subgroups
  patient_id = my_data$patient_id_numeric,
  age_months = my_data$age_months_formatted,
  allergy = my_data$peanut_allergy
)



fit <- stan(file = 'C:/Folder/stan_model.stan', data = stan_data, iter = 1000, chains = 4)

However, it takes forever. Whatever I tried I wasn’t able to get a result. It estimates around 5 hours to complete the task, but I waited for over 12 hours with no result. I wonder if I am doing something wrong. I will aprreciate if you have any advice about it.

Welcome to the Stan Discourse.

I bet it is because you’re treating the respondent-level probabilities (z) as parameters but then not using them. So you have N \times K parameters that don’t influence the likelihood. I would just remove that line. You can always calculate those values in the generated quantities block.

I would also recommend making the lower bound for sigma some arbitrary amount above zero. These sorts of mixture models can converge towards infinity as one cluster narrows in on a single point and the standard deviation converges to 0.

Finally, Stan moved to a new array syntax that you’ll want to use to avoid future issues. Instead of real alpha[K]; you would write array[K] real alpha;

Hope that helps. Good luck!

1 Like

To @simonbrauer 's useful answer, I’ll just add that where you write

I assume that you’re referring to Rstan’s message that 1000 transitions using 10 leapfrog steps per transition would take XXX seconds. Note that this is not an estimate of how long the task will take, but a heuristic that lets you guess at how long the task might take. In reality the number of leapfrog steps per transition is likely to be much (perhaps an order of magnitude or more) higher than 10.

1 Like

In addition to the long sampling time, you may also face a non-identifiability (Gelman called it “Aliasing”) problem which could manifest as converged yet dispersed posterior (big probability interval).
More specifically in mixture models, this is the label switching problem. Simply put, the log likelihood could be considered as a weighted sum of log likelihoods from all latent groups. Exchange of labels does not affect the weighted sum. And as K grows, the problem also grows to K! posterior maxima.
One solution is to specify some restrictions in the labels, simplex[K] theta in your case. theta must be monotonically increasing or decreasing for example. Of course such restrictions have obvious limitations. Other solutions include post hoc relabelling.

1 Like

I didn’t notice your sample size initially. I often work with similar models with closer to 1,500 cases, which can take 30 minutes or more depending on the situation. So scaling that up by 100 doesn’t sound so surprising.

A few more options to consider. You can compress your data by passing only one row for each unique combination of age and allergy and include a weight indicating the number of rows collapsed. This gets less helpful as you add more predictors, but would make a huge difference here. Even if you had every age represented from 0 to 100 years, you’d only have 2,400 rows.

Another possibility would be to write a separate vectorized Log-likelihood function that reuses calculations as much as possible.

Note how you’re calculating log(theta[k]]) N \times K times when you could just pre-calculate it K times. You also have several pieces of the normal likelihood which are only dependent on sigma. You could pre-calculate those ahead of time too. I don’t know how much this would speed things up, but it might make a difference.

Thank you for your suggestions. My dataset is a longitudinal panel at the individual level. Each row already represents a unique combination of patient_id, monthly_age, and allergy_status. So, for around 1000 individuals and 150 months, I have around 150,000 observations.

Sorry for the confusion; just to clarify. The model as it currently is written doesn’t depend on patient_id. So as far as this model is concerned, the unique cells are defined by age_months and allergy. If you end up adding something like person-level intercepts then that will no longer be the case.

Thank you, I see it now. I think it makes sense to compress the data the way you described. It indeed makes a huge difference.