Hi Stan Community,
I’m a new Stan user, first time poster, so apologies if this is a very basic/stupid question.
I’m trying to specify a model to recover the parameters of some simulated data: I’m simulating data from 2 normal distributions, which are mixed together based on some proportion “lambda” (0<lambda<1), which varies by sample (the parameters of the normals do not change).
I’m simulating the data with the following R code:
library(rstan)
set.seed(12345)
# the number of smaples
N <- 2000
# create a simulated vector of proportions between 0 and 1
lambda <- (1:(N+1)) / (N+1)
lambda <- lambda[1:N]
# "real" values of the parameters we will try to recover.
mu1_real <- 0
mu2_real <- 10
sigma1_real <- 2
sigma2_real <- 3
# create simulated data, mixing values drawn from these two normal distributions
y <- (lambda * rnorm(N, mu1_real, sigma1_real)) + ((1-lambda) * rnorm(N, mu2_real, sigma2_real))
stan.fit <- stan(file="estimate_normals.stan",
data = list(lambda=lambda,
y=y,
N=N
),
chains=3,
iter=1000,
init=0);
summary(stan.fit)$summary
a <- extract(stan.fit)
Currently, the model I’m building in Stan to recover the parameters of the two normal distributions looks like this:
data
{
int<lower=0> N; // number of cases
vector[N] y; // outcome
real lambda[N]; // The proportions (0<lambda<1)
}
parameters
{
real mu[2]; // estimated means
real<lower=0> sigma[2]; // estimated noise
}
model
{
// create mixture model.
for(i in 1:N)
{
target += log_mix(lambda[i], normal_lpdf(y[i] | mu[1], sigma[1]), normal_lpdf(y[i] | mu[2], sigma[2]));
// the following commented line produces equivalent results...
//target += log_sum_exp(log(lambda[i]) + normal_lpdf(y[i] | mu[1], sigma[1]), log(1 - lambda[i]) + normal_lpdf(y[i] | mu[2], sigma[2]));
}
}
I think what I’m trying to do here is a little weird, as in most of the examples in the manual, the proportion lambda is a parameter, and in my case, it is data.
The recovered parameters end up being quite a bit off from the simulated values, i.e.:
mu[1] 2.318119
mu[2] 7.621173
sigma[1] 2.054273
sigma[2] 2.500897
I thought someone might have a quick fix for where I’ve gone wrong?
PS: I know that I could combine these 2 normal distributions into one normal and estimate these parameters that way, but I’m aiming to fit this type of mixture model with more complex distributions (which I don’t think could be easily combined) and I’m using the normal distribution as a toy example to start with…
Thanks so much!