Generated quantities for mixture models


#1

Hello,
I’ve been trying to fit some mixture models (Gaussian and Poisson). I am unsure of how to write the generated quantities block. I have seen some use cases in the discourses as well as tried to adapt the code in the Stan manual but have not had meaningful results.
I put the code for my Gaussian mixture model below, I would greatly appreciate some help with writing the “generated quantities” block.

mix_model <- "
data {
int<lower = 0> N;
vector[N] y;
}

parameters {
ordered[2] mu;
real<lower=0> sigma[2];
real<lower=0, upper=1> theta;
}

model {
sigma ~ normal(0, 2);
mu[1] ~ normal(70, 10);
mu[2] ~ normal(110,10);
theta ~ beta(5, 5);
for (n in 1:N){
target += log_mix(theta,
normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]));
}
}
"
Please let me know if I can provide more information/data etc to make my question clearer.

Additionally a question about Stan manual (version 2.17.0),
on page 194, for a mixture model with K mixture components the following code is given for the parameters block;

parameters {
simplex[K] theta; // mixing proportions
ordered mu[K]; // locations of mixture components
vector<lower=0>[K] sigma; // scales of mixture components
}

I think the [K] should be right after “ordered”, because Stan threw up an error, unless I am wrong, some clarification would help.


#2

Well, what is it that you want to compute, exactly? What quantities do you want? Posterior membership probabilities? Log likelihoods?


#3

Sorry I missed that, I want to generate replicated data to check my model.


#4

In trying to understand this better and to generalize for K mixture components, I am trying to operationalize the model block on page 194 of the Stan manual;

When I use :
model {
real log_theta[K] = log(theta); // cache log calculation
sigma ~ lognormal(0, 2);
mu ~ normal(0, 10);
for (n in 1:N) {
real lps[K] = log_theta;
for (k in 1:K)
lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
target += log_sum_exp(lps);
}
}
Stan gives me a base type mismatch error with the line “cache log calculation”. I then use for loops to calculate the log_theta and lps and this works so my model code looks like this:

model {
real log_theta[K];
real lps[K];
for(k in 1:K){
log_theta[k] = log(theta[k]); // cache log calculation
}
sigma ~ lognormal(0, 2);
mu[1] ~ normal(70, 10);
mu[2] ~ normal(110,10);
for (n in 1:N) {
for (k in 1:K){
lps[k] = log_theta[k];
lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
}
target += log_sum_exp(lps);
}

Any help understanding how to operationalize the original code in the manual would be much appreciated. I have attached my code (.R file with Stan code in it) and data.

data_long.csv (7.8 KB)

discourse_1.R (969 Bytes)


#5

Here is how I understand your desired model: you believe you have normally distributed outcomes Y, conditional on some unobserved discrete latent class taking on K levels (here, 2). Different observations are allowed to come from different underlying classes. The only information about class membership comes from the outcome itself, not any covariates. You want to obtain posterior estimates of the population class membership probabilities theta as well as the means in those two classes, mu_1 and mu_2 with mu_1 < mu_2.

For the generated quantities, is it that you want to simulate N totally new observations based on what you have learned from your data set? In that case, for each new observation I would draw a component identifier comp taking on levels 1, …, K then draw a new outcome ytilde based on the distribution applicable to that component. Your new simulated data should have component proportions near your estimates of theta. Because you know the true class membership for these observations, you can look at the outcome distribution conditional on component or overall.

I don’t know much about the actual estimation of these models, but the code below is how I would do it. There may be a more efficient way. I added a prior_only option in the data block (0 if you want posterior predictive, 1 if you want prior predictive). That feature will let you examine the implications of your prior specification.

data {
  int<lower=1> K; // number of mixture components
  int<lower=1> N; // number of data points
  real y[N]; // observations
  int<lower=0,upper=1> prior_only; // =1 to get data from prior predictive
}
parameters {
  simplex[K] theta; // mixing proportions
  ordered[K] mu; // locations of mixture components
  vector<lower=0>[K] sigma; // scales of mixture components
}
transformed parameters {
  vector[K] log_theta = log(theta);
}
model {
  // contributions from each component
  // these get overwritten for each observation
  vector[K] lps; 
  
  // priors
  sigma ~ lognormal(0, 2);
  mu[1] ~ normal(70, 10);
  mu[2] ~ normal(110, 10);
  mu ~ normal(80, 10);
  
  // individual likelihoods, as sum of component contributions
  for (n in 1:N) {
    for (k in 1:K) {
      lps[k] = log_theta[k] + normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    // add the observation likelihood contribution if not prior-only
    if (prior_only == 0) {
      target += log_sum_exp(lps);  
    }
  }
}
generated quantities {
  // we will generate a component identifier "comp"
  // then draw "ytilde" using correct component mu and sigma
  int<lower=1,upper=K> comp[N];
  vector[N] ytilde;
  
  // can draw N new observations or as many as you want
  for (n in 1:N) {
    comp[n] = categorical_rng(theta);
    ytilde[n] = normal_rng(mu[comp[n]], sigma[comp[n]]);
  }
}


#6

This is awesome.
Thanks a lot.
There are still some problems with my own model specification, but this is great to start with.


#7

what can we do if we need to identify the observed data Y[N] based on the theta? Say if the value of theta is greater than 0.6, it will be in second group and if it is less than or equal to 0.6, it will be in first group. Or in another way, is it possible to get Y[N] , mu[1], mu[2] and theta side by side? Example:

Y     mu[1]   theta
12     11.9     0.4
13     13.6     0.2
21     19.6     0.8
29     28.9     0.7
.
.
.

#8

@yab, I don’t understand your question. Can you give some more context for when this type of structure would arise?


#9

Based on the above code we can estimate the value of mu[1] and mu[2] with in the mixture of Gaussian and Poisson. Here we will have the mean of the first and second group with in its probability. But, what if we want to have the probability and the estimated mean value at observation level, that is each observation will have the estimated theta, mu[1] and mu[2]. Then after we will think that each observation has the theta[1] probability to in the first group and with in theta[2] probability in the second group. Therefore, we should have two mu and theta with N dimension (the second theta = 1- theta). After having theta , we can categorize each observation based on this value (example, below 0.6 we will group in group 1 …). I have tried this code as follows with some error.

data {

int<lower=0> N;           // number of data points
int <lower=1> y[N];         // observations
matrix[N,4] x;               // Covariates 
real<lower=0> mi[N];
}

transformed data {
real<lower=0> Ami;
Ami=mean(mi);
}

parameters {
real<lower=0> ci; // any Positive number added on the mu[1] to get mu[2] 
vector[4] beta;
real alpha;
vector<lower=0, upper=1>[N] pi;  // mixing proportions; need to have pi's at observation level

}

transformed parameters{
vector<lower=0>[N] mu[2];    // locations of mixture components; need to have mu's for each observation
    mu[1] = exp(alpha + x*beta);
    mu[2] = exp((alpha + x*beta)+ci);
  
}

model {

// Priors
  beta ~ normal(0, 1e1);
  alpha ~ normal(0,1e1);
  ci ~ normal(0, 1e1);
  pi ~ beta(mi,Ami);     // how can we change to Dirichlet distribution using mi and mci?
// Likelihood

  for (i in 1:N) 
      target += log_mix(pi,
                  poisson_lpmf(y[i] | mu[1]) - log1m_exp(poisson_lpmf(0 | mu[1])),
                  poisson_lpmf(y[i] | mu[2]) - log1m_exp(poisson_lpmf(0 | mu[2])));
}

When I run the above code, I got this error

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
No matches for: 
  log_mix(vector, real, real)
Available argument signatures for log_mix:
  log_mix(real, real, real)
  log_mix(real[], real[])
  log_mix(real[], vector)
  log_mix(real[], row vector)
  log_mix(real[], vector[])
  log_mix(real[], row vector[])
  log_mix(vector, real[])
  log_mix(vector, vector)
  log_mix(vector, row vector)
  log_mix(vector, vector[])
  log_mix(vector, row vector[])
  log_mix(row vector, real[])
  log_mix(row vector, vector)
  log_mix(row vector, row vector)
  log_mix(row vector, vector[])
  log_mix(row vector, row vector[])

  error in 'modeld8755c72_ff542624806624524b7b715fc8665283' at line 56, column 82
  -------------------------------------------------
    54:       target += log_mix(pi,
    55:                   poisson_lpmf(y[i] | mu[1]) - log1m_exp(poisson_lpmf(0 | mu[1])),
    56:                   poisson_lpmf(y[i] | mu[2]) - log1m_exp(poisson_lpmf(0 | mu[2])));
                                                                                         ^
    57:   
  -------------------------------------------------
Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'ff542624806624524b7b715fc8665283' due to the above error.
In addition: Warning message:
In file.remove(tf) :
  cannot remove file 'C:\Users\521-04\AppData\Local\Temp\Rtmpc35uis\filed82a886c34', reason 'Permission denied'

I understand the problem here that the log_mix is real and it will not consider vector mixing proportion. But I need the mixing proportion at observation level and how can i handle this.

FYI: This is posted under https://discourse.mc-stan.org/t/mixture-bayesian-poisson-regression-model/7663/25 and when I change the mixing proportion to real<lower=0, upper=1> pi it was working properly but I could not get the mixing proportion at observation level.