Estimating parameters of two normal distribution, mixed at proportions

Hi Stan Community,

I’m a new Stan user, first time poster, so apologies if this is a very basic/stupid question.

I’m trying to specify a model to recover the parameters of some simulated data: I’m simulating data from 2 normal distributions, which are mixed together based on some proportion “lambda” (0<lambda<1), which varies by sample (the parameters of the normals do not change).

I’m simulating the data with the following R code:

library(rstan)
set.seed(12345)

# the number of smaples
N <- 2000

# create a simulated vector of proportions between 0 and 1
lambda <- (1:(N+1)) / (N+1)
lambda <- lambda[1:N]

# "real" values of the parameters we will try to recover.
mu1_real <- 0
mu2_real <- 10
sigma1_real <- 2
sigma2_real <- 3

# create simulated data, mixing values drawn from these two normal distributions
y <- (lambda * rnorm(N, mu1_real, sigma1_real)) + ((1-lambda) * rnorm(N, mu2_real, sigma2_real))

stan.fit <- stan(file="estimate_normals.stan", 
                 data = list(lambda=lambda, 
                             y=y,
                             N=N
                 ), 
                 chains=3, 
                 iter=1000, 
                 init=0);
summary(stan.fit)$summary
a <- extract(stan.fit)

Currently, the model I’m building in Stan to recover the parameters of the two normal distributions looks like this:

data 
{
  int<lower=0> N; // number of cases
  vector[N] y; // outcome
  real lambda[N]; // The proportions (0<lambda<1)
  
}
parameters 
{
  real mu[2]; // estimated means
  real<lower=0> sigma[2]; // estimated noise
}
model 
{
  // create mixture model.
  for(i in 1:N)
  {
    target += log_mix(lambda[i], normal_lpdf(y[i] | mu[1], sigma[1]), normal_lpdf(y[i] | mu[2], sigma[2]));
    
    // the following commented line produces equivalent results...
    //target += log_sum_exp(log(lambda[i]) + normal_lpdf(y[i] | mu[1], sigma[1]), log(1 - lambda[i]) + normal_lpdf(y[i] | mu[2], sigma[2]));
  }
}

I think what I’m trying to do here is a little weird, as in most of the examples in the manual, the proportion lambda is a parameter, and in my case, it is data.

The recovered parameters end up being quite a bit off from the simulated values, i.e.:

mu[1]        2.318119
mu[2]        7.621173
sigma[1]     2.054273
sigma[2]     2.500897

I thought someone might have a quick fix for where I’ve gone wrong?

PS: I know that I could combine these 2 normal distributions into one normal and estimate these parameters that way, but I’m aiming to fit this type of mixture model with more complex distributions (which I don’t think could be easily combined) and I’m using the normal distribution as a toy example to start with…

Thanks so much!

Lambda should be of size 1 (e.g. 0.5) namely 50% of observations come from one distribution and 50% come from the other, unless I am missing something.

This is not a mixture of two normal but just a linear combination. The idea of the mixture is to have either one of the two distribution for each observation. Not both.

mixture <- rbinom(N, 1, lambda)
y <- rep(NA, N)
for (i in 1:N){
  if (mixture[i] == 0){
    y[i] <- rnorm(1, 0, 2)
  } else {
    y[i] <- rnorm(1, 10, 3)
  }
}

@stemangiola is right that it’s going to be difficult to estimate a model like that when you have a different lambda for each observation. You have N lambda’s + 2 mus + 2 sds = N + 4 parameters for N observations.

1 Like

Hi stemangiola and stijn,

Thanks for the helpful replies. Sooo it seems I’m quite off base in thinking this is a mixture model, when its actually a linear combination. Thanks for the insights!!

Do you know if there is any approach that can allow me to estimate these underlying distributions in such a linear combination? Or examples out there?

Am I right in perhaps thinking that this would need to be specified as a new “custom distribution” altogether? I think if lambda is quite variable (which in the example it is) it should be possible to somehow estimate these parameters.

Sorry if this is such a basic question I’m very new to this!

Yes I was thinking @steph_ren was confusing convolution with mixing.

If you have a linear combination of let’s say 2 groups

observation_A = number_1 * proportionA + number_2 * (1 - proportionA )
observation_B = number_1 * proportionB + number_2 * (1 - proportionB )
observation_C = number_1 * proportionC + number_2 * (1 - proportionC )
observation_D = number_1 * proportionD+ number_2 * (1 - proportionD)
observation_E = number_1 * proportionE + number_2 * (1 - proportionE )
observation_F = number_1 * proportionF + number_2 * (1 - proportionF )
...

Basically you infer values (parameter of size 2) non distributions. The posterior of such single two parameters with have an uncertainty associated. That will be your distribution.

Hi Stemangiola,

Thanks again for the reply. I understand what you are saying now: i.e. infer the values, not the distributions. Then these values should follow the underlying distribution(s).

Maybe this is obvious, but unfortunately I’m still not entirely sure about is how I would go about inferring the values?

Thanks again!