I have a conceptual question that I’m sure is a basic misunderstanding on my part, but I’ve been searching for the better part of a day now and haven’t found a resource that helps me understand my error. I’m a new Stan user, coming from JAGS/BUGS.
My long-term goal is to construct a model where a sequence of observations is generated by a mixture of two distinct processes. All observations in the sequence are assumed to come from one process or the other, unlike the examples in the manual where each observation could come from either process.
I started with a simple Gaussian mixture model, and I’m puzzled by the behavior that I’m seeing.
data {
int<lower=0> N;
vector[N] y;
}
parameters {
real<lower=0, upper=1> theta;
real<lower=0> sigma;
}
model {
target += log_mix(
theta,
normal_lpdf(y | 1000, sigma),
normal_lpdf(y | 0, sigma)
);
theta ~ beta(1,1);
sigma ~ normal(0,10);
}
The data are
N <- 100
y <- rnorm(N, 0, 1)
What I’d expect from the model is that the posterior samples for theta
would be mostly values close to 0, since this set of data is far more likely under normal_lpdf(y | 0, sigma)
than normal_lpdf(y | 1000, sigma)
. Instead I get a weak preference for lower values.
I thought that varying N
, and adding more data, should change the posterior estimates for theta
but the estimates seem to be invariant to those changes. The estimates also seem to be mostly invariant to the distance between the two means of the two generating Gaussians. Setting 1000
to 10
didn’t make a noticeable difference. Even setting it to 0.5
didn’t seem to have an effect.
Am I conceptually misunderstanding the effect that adding more data to the model should have? Or am I implementing this wrong?
Appreciate the help!