Recovering class index in mixture model using generated quantities

Dear Stan users,

I want to fit a two-component mixture model to a sample of antibody count data, assuming that individual antibody counts arise from a mixture of two subpopulations, one for the individuals still susceptible to the infection, the other for individuals who show immunity due to past infection or vaccination.

Below, I attach my code. My problem is related to how to generate, in the generated quantities block, the class index Z for each individual, which tells me to which mixture component each individual, based on his or her antibody count, is assigned.
I guess that the current way of calculating Z, as shown below, is not correct. My estimate of \theta, the vector containing the mixture weights, is (0.25, 0.75), for the susceptible and the immune component, respectively. When generating Z using the categorical_rng function, I obtain that all individuals are assigned to the component 2, as its mixture weight is higher than 0.5.

Any guess on how to correctly estimate Z?

Thanks in advance,
Emanuele

data {
  int<lower=0> K;        // number of mixture components
  int<lower=0> N;        // number of data points
  vector[N] logod;           // observations
  
}

parameters {
  simplex[K] theta;          // mixing weight of immune component
  ordered[K] mu;             // locations of mixture components
  vector<lower=0>[K] sigma;  // scales of mixture components
  
}

model {
  
  // Priors
  mu ~ normal(0, 10);
  sigma ~ normal(0,1);
  theta ~ dirichlet(rep_vector(1,K)); 
  
  // Log joint density f(data, parameters)
  for (n in 1:N){
    target += log_sum_exp(log(theta[1])
                          + normal_lpdf(logod[n] | mu[1], sigma[1]),
                          log(theta[2]) 
                          + normal_lpdf(logod[n] | mu[2], sigma[2]));
  }
  
}
generated quantities{
  vector[N] log_lik; // log-likelihood for calculation of LOO
  int<lower=1> Z[N];          // group index

  for (n in 1:N){
    for(k in 1:K){
      log_lik[n] = log(theta[k]) + normal_lpdf(logod[n] | mu[k], sigma[k]);      
    }
    Z[n] = categorical_rng(theta);
  }
  
}

That doesn’t sound right. categorical_rng does not just select the maximum probability category.
If \theta is (0.25,0.75) then one-in-four individuals are assigned to the first category.

Anyway, mixture weight theta is essentially a prior for category assignment and categorical_rng(theta) samples a new individual.
You can recover the categories for observed individuals by applying Bayes’ theorem and sampling the posterior assignment probability. This is straightforward; posterior is equal to normalized likelihood.

generated quantities{
  vector[N] log_lik; // log-likelihood for calculation of LOO
  int<lower=1> Z[N];          // group index

  for (n in 1:N) {
    vector[K] ll;
    for(k in 1:K) {
      ll[k] = log(theta[k]) + normal_lpdf(logod[n] | mu[k], sigma[k]);      
    }
    log_lik[n] = log_sum_exp(ll);
    Z[n] = categorical_rng(exp(ll - log_lik[n]));
  }
}
1 Like

Thank you very much for showing me how to adjust my code to correctly computing the group index of each observation and for the explanation.