Generated quantities for mixture models

Hello,
I’ve been trying to fit some mixture models (Gaussian and Poisson). I am unsure of how to write the generated quantities block. I have seen some use cases in the discourses as well as tried to adapt the code in the Stan manual but have not had meaningful results.
I put the code for my Gaussian mixture model below, I would greatly appreciate some help with writing the “generated quantities” block.

mix_model <- "
data {
int<lower = 0> N;
vector[N] y;
}

parameters {
ordered[2] mu;
real<lower=0> sigma[2];
real<lower=0, upper=1> theta;
}

model {
sigma ~ normal(0, 2);
mu[1] ~ normal(70, 10);
mu[2] ~ normal(110,10);
theta ~ beta(5, 5);
for (n in 1:N){
target += log_mix(theta,
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]));
}
}
"
Please let me know if I can provide more information/data etc to make my question clearer.

Additionally a question about Stan manual (version 2.17.0),
on page 194, for a mixture model with K mixture components the following code is given for the parameters block;

parameters {
simplex[K] theta; // mixing proportions
ordered mu[K]; // locations of mixture components
vector<lower=0>[K] sigma; // scales of mixture components
}

I think the [K] should be right after “ordered”, because Stan threw up an error, unless I am wrong, some clarification would help.

Well, what is it that you want to compute, exactly? What quantities do you want? Posterior membership probabilities? Log likelihoods?

1 Like

Sorry I missed that, I want to generate replicated data to check my model.

In trying to understand this better and to generalize for K mixture components, I am trying to operationalize the model block on page 194 of the Stan manual;

When I use :
model {
real log_theta[K] = log(theta); // cache log calculation
sigma ~ lognormal(0, 2);
mu ~ normal(0, 10);
for (n in 1:N) {
real lps[K] = log_theta;
for (k in 1:K)
lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
target += log_sum_exp(lps);
}
}
Stan gives me a base type mismatch error with the line “cache log calculation”. I then use for loops to calculate the log_theta and lps and this works so my model code looks like this:

model {
real log_theta[K];
real lps[K];
for(k in 1:K){
log_theta[k] = log(theta[k]); // cache log calculation
}
sigma ~ lognormal(0, 2);
mu[1] ~ normal(70, 10);
mu[2] ~ normal(110,10);
for (n in 1:N) {
for (k in 1:K){
lps[k] = log_theta[k];
lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
}
target += log_sum_exp(lps);
}

Any help understanding how to operationalize the original code in the manual would be much appreciated. I have attached my code (.R file with Stan code in it) and data.

data_long.csv (7.8 KB)

discourse_1.R (969 Bytes)

Here is how I understand your desired model: you believe you have normally distributed outcomes Y, conditional on some unobserved discrete latent class taking on K levels (here, 2). Different observations are allowed to come from different underlying classes. The only information about class membership comes from the outcome itself, not any covariates. You want to obtain posterior estimates of the population class membership probabilities theta as well as the means in those two classes, mu_1 and mu_2 with mu_1 < mu_2.

For the generated quantities, is it that you want to simulate N totally new observations based on what you have learned from your data set? In that case, for each new observation I would draw a component identifier comp taking on levels 1, …, K then draw a new outcome ytilde based on the distribution applicable to that component. Your new simulated data should have component proportions near your estimates of theta. Because you know the true class membership for these observations, you can look at the outcome distribution conditional on component or overall.

I don’t know much about the actual estimation of these models, but the code below is how I would do it. There may be a more efficient way. I added a prior_only option in the data block (0 if you want posterior predictive, 1 if you want prior predictive). That feature will let you examine the implications of your prior specification.

data {
  int<lower=1> K; // number of mixture components
  int<lower=1> N; // number of data points
  real y[N]; // observations
  int<lower=0,upper=1> prior_only; // =1 to get data from prior predictive
}
parameters {
  simplex[K] theta; // mixing proportions
  ordered[K] mu; // locations of mixture components
  vector<lower=0>[K] sigma; // scales of mixture components
}
transformed parameters {
  vector[K] log_theta = log(theta);
}
model {
  // contributions from each component
  // these get overwritten for each observation
  vector[K] lps; 
  
  // priors
  sigma ~ lognormal(0, 2);
  mu[1] ~ normal(70, 10);
  mu[2] ~ normal(110, 10);
  mu ~ normal(80, 10);
  
  // individual likelihoods, as sum of component contributions
  for (n in 1:N) {
    for (k in 1:K) {
      lps[k] = log_theta[k] + normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    // add the observation likelihood contribution if not prior-only
    if (prior_only == 0) {
      target += log_sum_exp(lps);  
    }
  }
}
generated quantities {
  // we will generate a component identifier "comp"
  // then draw "ytilde" using correct component mu and sigma
  int<lower=1,upper=K> comp[N];
  vector[N] ytilde;
  
  // can draw N new observations or as many as you want
  for (n in 1:N) {
    comp[n] = categorical_rng(theta);
    ytilde[n] = normal_rng(mu[comp[n]], sigma[comp[n]]);
  }
}

1 Like

This is awesome.
Thanks a lot.
There are still some problems with my own model specification, but this is great to start with.

@yab, I don’t understand your question. Can you give some more context for when this type of structure would arise?