Mixture models: how to get P(data | cluster)


the answer could be as obvious as saving

neg_binomial_2_lpmf(… | …, …)

in the transformed parameters and use it into the model, so you would have a measure of the likelihood to of one data point of belonging to a cluster. Then I would soft_max.

Is this correct?

What about the generated quantities approach in the documentation. I didn;t really get it.

data {
int<lower = 0> N_tilde;
vector[N_tilde] y_tilde;
generated quantities {
vector[N_tilde] log_p_y_tilde;
for (n in 1:N_tilde)
= log_mix(lambda,
normal_lpdf(y_tilde[n] | mu[1], sigma[1])
normal_lpdf(y_tilde[n] | mu[2], sigma[2]));


I sort of get what you’re asking but could you be more specific about what you want and what your model is? People will waste their time trying to answer this in the current form.


my mixture model is

data {
	int<lower=1> G;
  int<lower=0> y[G]; // data. gene mupression (integer counts) 
  real num_zeros;

parameters {
real<lower=0> mu1; // prior location
real<lower=mu1> mu2;
real mu_zero[2];

real<lower=0> si[2];
real<lower=0, upper=1> lambda;
real<lower=0, upper=1> lambda_zero;

model {
target += log_mix(lambda_zero,bernoulli_logit_lpmf(0 | mu_zero[1]),
	                    bernoulli_logit_lpmf(0 | mu_zero[2])) * num_zeros;
target += log_mix(lambda_zero,bernoulli_logit_lpmf(1 | mu_zero[1]),
	                    bernoulli_logit_lpmf(1 | mu_zero[2])) * G;
for(g in 1:G){
	target += log_mix(lambda,
		neg_binomial_2_lpmf(y[g] | mu1, si[1]), //NON
	 	neg_binomial_2_lpmf(y[g] | mu2, si[2]) //EXP
		si ~ cauchy(0, 2.5);
		mu1 ~ gamma(2, 1);
		mu2 ~ gamma(5, 0.002);
		lambda ~ beta(2,2);
		lambda_zero ~ beta(2,2);
		mu_zero ~ normal(0,1);
generated quantities{
	int y_hat_1[G];
	int y_hat_2[G];
	for(g in 1:G) y_hat_1[g] = neg_binomial_2_rng(mu1, si[1]);
	for(g in 1:G) y_hat_2[g] = neg_binomial_2_rng(mu2, si[2]);

Now I would like to get the probability that y[g] belongs to cluster neg_binomial_2 1 and 2

It’s lambda. Your model assumes the mixture probability is homogenous over each draw, so it’s the same probability for each datum.

But a probability for y[1] | ckuster[1] for a datum that is at the centre of the cluster, is not higher than for a datum that is far away of such cluster?

wouldn’t it be log(lambda) + neg_binomial_2_lpmf(y[g] | mu1, si[1])

and then I can at least compare the log likelihoods between data?

@betanalpha’s still jetlagged. What he gave you is the prior probability before observing y. If we let z[g] in {1, 2} be the component from which y[g] is drawn, then conditioning on y[g], we have

once you’ve observed y, the probability that y[g] was drawn from component 1 is

p1 =  lambda * neg_binomial_2(y[g] | mu1, si[1])

p2 = (1 - lambda) * neg_binomial_2(y[g] | mu2, si[2])

Pr[z[g] = 1 | y[g], lambda, mu, si]  =  p1 / (p1 + p2)

To do this with arithmetic stability, you want to put everything on the log scale and use log1m(lambda) instead of log(1 - lambda), so it should look like this:

generated quantities {
  real<lower=0, upper=1> log_Pr[G];
  for (g in 1:G) {
    real lp1 = log(lambda) + neg_binomial_2_lpmf(y[g] | mu1, si[1]);
    real lp2 = log1m(lambda) + neg_binomial_2_lpmf(y[g] | mu2, si[2]);
    log_Pr[g] = lp1 - log_sum_exp(lp1, lp2);

Thanks Bob, and Betan


should I use generated quantities in a 2 steps model with parameters passed as data, or should I calculate generated quantities in the prediction model itself?


Until we figure out how to allow generated quantities to run after sampling, it all has to go in the same model. The generated quantities are very fast to execute as we don’t need derivatives or multiple steps per iteration.

As to the earlier question, if you want to evaluate the likelihood, that’s what you already coded up for the mixture—the likelihood of the data marginalizes out the latent discrete responsibility parameter.

So just to make sure,

I can save directly my neg_binomial_2_lpmf(y[g] | mu1, si[1]) into generated quantities? For later calculation of the cluster assignment.

In any case

Isn’t a bit more complicated because of the zero-inflation. I have lambda and lambda 0. My goal however is get the probability of cluster 2

neg_binomial_2_lpmf(y[g] | mu2, si[2])

vs. anything else (cluster 1 + zero inflation).

P.S. I didn’t get the sentence

The discrete parameter that determines the mixture component is often called the “responsibility”. It’s an index for which component you drew from.

You could do the same calculation for the zero cases, and they’ll all be the same so you only need to do it once.

Sorry to bother you, but is this plausible? P(cluster 2 | … )

	for (g in 1:G) {
		if(y[g]==0) {
			real lp0 = log(lambda_zero) + bernoulli_logit_lpmf(0 | mu_zero[1]);
			real lp1 = log(lambda_zero) + bernoulli_logit_lpmf(1 | mu_zero[2]) + log(lambda) + neg_binomial_2_lpmf(y[g] | mu1, si[1]);
	        real lp2 = log(lambda_zero) + bernoulli_logit_lpmf(1 | mu_zero[2]) + log1m(lambda) + neg_binomial_2_lpmf(y[g] | mu2, si[2]);
	        log_Pr[g] = lp2 - log_sum_exp(lp0, lp1, lp2);
            real lp0 = 0;
	        real lp1 = log1m(lambda_zero) + bernoulli_logit_lpmf(1 | mu_zero[2]) + log(lambda) + neg_binomial_2_lpmf(y[g] | mu1, si[1]);
	        real lp2 = log1m(lambda_zero) + bernoulli_logit_lpmf(1 | mu_zero[2]) + log1m(lambda) + neg_binomial_2_lpmf(y[g] | mu2, si[2]);
	        log_Pr[g] = lp2 - log_sum_exp(lp0, lp1, lp2);


Just one note,

real<lower=0, upper=1> log_Pr[G];

should probably read

real log_Pr[G];

since it is a log probability

You’re right—originally started with probabilities. You could add real<upper=0>, but it’s not necessary—it only does consistency checking.