# Mixture Models for sets of observations

I have a conceptual question that I’m sure is a basic misunderstanding on my part, but I’ve been searching for the better part of a day now and haven’t found a resource that helps me understand my error. I’m a new Stan user, coming from JAGS/BUGS.

My long-term goal is to construct a model where a sequence of observations is generated by a mixture of two distinct processes. All observations in the sequence are assumed to come from one process or the other, unlike the examples in the manual where each observation could come from either process.

I started with a simple Gaussian mixture model, and I’m puzzled by the behavior that I’m seeing.

``````data {
int<lower=0> N;
vector[N] y;
}

parameters {
real<lower=0, upper=1> theta;
real<lower=0> sigma;
}

model {
target += log_mix(
theta,
normal_lpdf(y | 1000, sigma),
normal_lpdf(y | 0, sigma)
);

theta ~ beta(1,1);
sigma ~ normal(0,10);
}
``````

The data are

``````N <- 100
y <- rnorm(N, 0, 1)
``````

What I’d expect from the model is that the posterior samples for `theta` would be mostly values close to 0, since this set of data is far more likely under ` normal_lpdf(y | 0, sigma)` than ` normal_lpdf(y | 1000, sigma)`. Instead I get a weak preference for lower values.

I thought that varying `N`, and adding more data, should change the posterior estimates for `theta` but the estimates seem to be invariant to those changes. The estimates also seem to be mostly invariant to the distance between the two means of the two generating Gaussians. Setting `1000` to `10` didn’t make a noticeable difference. Even setting it to `0.5` didn’t seem to have an effect.

Am I conceptually misunderstanding the effect that adding more data to the model should have? Or am I implementing this wrong?

Appreciate the help!

`theta` is the probability that an entire observation sequence comes from one mixture component rather than the other. You only have one sequence so estimate of `theta` is vague.
Think of it like this: Before seeing the data you don’t know anything about `theta` so it’s uniform. After seeing the first data point you know `theta` is probably closer to zero than one because the data point is very close to the mean of the second process. But you already know that the second data point is going to come from the same process so seeing that it also is near the same mean doesn’t tell you anything new, and so on for the whole sequence. You only get one sample from `theta` per independent sequence.
EDIT:
Or, the complete data generating process is

``````N <- 100
theta <- rbeta(1,1,1)
m = 1000*rbinom(1,1,theta)
y <- rnorm(N, m, 1)
``````

Increasing `N` can help you estimate `m` but that still doesn’t tell you much about `theta`.

So what would be the correct model specification if I want theta to represent the posterior probability that this data sequence is generated by process A or process B, rather than the posterior probability that any given sequence will be generated by A or B?

Posterior probabilities are not parameters in the model. You calculate them using Bayes’ theorem.
See the mixture model section of the user guide.

The numbers needed for that are just the same likelihoods that go in `target +=`.
The normalization is easily handled by the `softmax` function.

``````model {
target += log_mix(
prior,
normal_lpdf(y | 1000, sigma),
normal_lpdf(y | 0, sigma)
);
...
}
generated quantities {
real lp[2];
real posterior[2];
// log_mix(...) above is equal to log_sum_exp(lp) here
// but we need individual components
lp[1] = log(prior) + normal_lpdf(y | 1000, sigma);
lp[2] = log1m(prior) + normal_lpdf(y | 0, sigma);
posterior = softmax(lp);
}
``````