Mixture Models for sets of observations

I have a conceptual question that I’m sure is a basic misunderstanding on my part, but I’ve been searching for the better part of a day now and haven’t found a resource that helps me understand my error. I’m a new Stan user, coming from JAGS/BUGS.

My long-term goal is to construct a model where a sequence of observations is generated by a mixture of two distinct processes. All observations in the sequence are assumed to come from one process or the other, unlike the examples in the manual where each observation could come from either process.

I started with a simple Gaussian mixture model, and I’m puzzled by the behavior that I’m seeing.

data {
  int<lower=0> N;
  vector[N] y;

parameters {
  real<lower=0, upper=1> theta;
  real<lower=0> sigma;

model {
  target += log_mix(
    normal_lpdf(y | 1000, sigma),
    normal_lpdf(y | 0, sigma)
  theta ~ beta(1,1);
  sigma ~ normal(0,10);

The data are

N <- 100
y <- rnorm(N, 0, 1)

What I’d expect from the model is that the posterior samples for theta would be mostly values close to 0, since this set of data is far more likely under normal_lpdf(y | 0, sigma) than normal_lpdf(y | 1000, sigma). Instead I get a weak preference for lower values.

I thought that varying N, and adding more data, should change the posterior estimates for theta but the estimates seem to be invariant to those changes. The estimates also seem to be mostly invariant to the distance between the two means of the two generating Gaussians. Setting 1000 to 10 didn’t make a noticeable difference. Even setting it to 0.5 didn’t seem to have an effect.

Am I conceptually misunderstanding the effect that adding more data to the model should have? Or am I implementing this wrong?

Appreciate the help!

theta is the probability that an entire observation sequence comes from one mixture component rather than the other. You only have one sequence so estimate of theta is vague.
Think of it like this: Before seeing the data you don’t know anything about theta so it’s uniform. After seeing the first data point you know theta is probably closer to zero than one because the data point is very close to the mean of the second process. But you already know that the second data point is going to come from the same process so seeing that it also is near the same mean doesn’t tell you anything new, and so on for the whole sequence. You only get one sample from theta per independent sequence.
Or, the complete data generating process is

N <- 100
theta <- rbeta(1,1,1)
m = 1000*rbinom(1,1,theta)
y <- rnorm(N, m, 1)

Increasing N can help you estimate m but that still doesn’t tell you much about theta.

Thanks you @nhuurre, that’s super helpful for breaking my conceptual roadblock.

So what would be the correct model specification if I want theta to represent the posterior probability that this data sequence is generated by process A or process B, rather than the posterior probability that any given sequence will be generated by A or B?

Posterior probabilities are not parameters in the model. You calculate them using Bayes’ theorem.
See the mixture model section of the user guide.

The numbers needed for that are just the same likelihoods that go in target +=.
The normalization is easily handled by the softmax function.

model {
  target += log_mix(
    normal_lpdf(y | 1000, sigma),
    normal_lpdf(y | 0, sigma)
generated quantities {
  real lp[2]; 
  real posterior[2];
  // log_mix(...) above is equal to log_sum_exp(lp) here
  // but we need individual components
  lp[1] = log(prior) + normal_lpdf(y | 1000, sigma);
  lp[2] = log1m(prior) + normal_lpdf(y | 0, sigma);
  posterior = softmax(lp);