Estimating parameters of two normal distribution, mixed at proportions

#1

Hi Stan Community,

I’m a new Stan user, first time poster, so apologies if this is a very basic/stupid question.

I’m trying to specify a model to recover the parameters of some simulated data: I’m simulating data from 2 normal distributions, which are mixed together based on some proportion “lambda” (0<lambda<1), which varies by sample (the parameters of the normals do not change).

I’m simulating the data with the following R code:

library(rstan)
set.seed(12345)

# the number of smaples
N <- 2000

# create a simulated vector of proportions between 0 and 1
lambda <- (1:(N+1)) / (N+1)
lambda <- lambda[1:N]

# "real" values of the parameters we will try to recover.
mu1_real <- 0
mu2_real <- 10
sigma1_real <- 2
sigma2_real <- 3

# create simulated data, mixing values drawn from these two normal distributions
y <- (lambda * rnorm(N, mu1_real, sigma1_real)) + ((1-lambda) * rnorm(N, mu2_real, sigma2_real))

stan.fit <- stan(file="estimate_normals.stan", 
                 data = list(lambda=lambda, 
                             y=y,
                             N=N
                 ), 
                 chains=3, 
                 iter=1000, 
                 init=0);
summary(stan.fit)$summary
a <- extract(stan.fit)

Currently, the model I’m building in Stan to recover the parameters of the two normal distributions looks like this:

data 
{
  int<lower=0> N; // number of cases
  vector[N] y; // outcome
  real lambda[N]; // The proportions (0<lambda<1)
  
}
parameters 
{
  real mu[2]; // estimated means
  real<lower=0> sigma[2]; // estimated noise
}
model 
{
  // create mixture model.
  for(i in 1:N)
  {
    target += log_mix(lambda[i], normal_lpdf(y[i] | mu[1], sigma[1]), normal_lpdf(y[i] | mu[2], sigma[2]));
    
    // the following commented line produces equivalent results...
    //target += log_sum_exp(log(lambda[i]) + normal_lpdf(y[i] | mu[1], sigma[1]), log(1 - lambda[i]) + normal_lpdf(y[i] | mu[2], sigma[2]));
  }
}

I think what I’m trying to do here is a little weird, as in most of the examples in the manual, the proportion lambda is a parameter, and in my case, it is data.

The recovered parameters end up being quite a bit off from the simulated values, i.e.:

mu[1]        2.318119
mu[2]        7.621173
sigma[1]     2.054273
sigma[2]     2.500897

I thought someone might have a quick fix for where I’ve gone wrong?

PS: I know that I could combine these 2 normal distributions into one normal and estimate these parameters that way, but I’m aiming to fit this type of mixture model with more complex distributions (which I don’t think could be easily combined) and I’m using the normal distribution as a toy example to start with…

Thanks so much!

#2

Lambda should be of size 1 (e.g. 0.5) namely 50% of observations come from one distribution and 50% come from the other, unless I am missing something.

#3

This is not a mixture of two normal but just a linear combination. The idea of the mixture is to have either one of the two distribution for each observation. Not both.

mixture <- rbinom(N, 1, lambda)
y <- rep(NA, N)
for (i in 1:N){
  if (mixture[i] == 0){
    y[i] <- rnorm(1, 0, 2)
  } else {
    y[i] <- rnorm(1, 10, 3)
  }
}

@stemangiola is right that it’s going to be difficult to estimate a model like that when you have a different lambda for each observation. You have N lambda’s + 2 mus + 2 sds = N + 4 parameters for N observations.

1 Like
#4

Hi stemangiola and stijn,

Thanks for the helpful replies. Sooo it seems I’m quite off base in thinking this is a mixture model, when its actually a linear combination. Thanks for the insights!!

Do you know if there is any approach that can allow me to estimate these underlying distributions in such a linear combination? Or examples out there?

Am I right in perhaps thinking that this would need to be specified as a new “custom distribution” altogether? I think if lambda is quite variable (which in the example it is) it should be possible to somehow estimate these parameters.

Sorry if this is such a basic question I’m very new to this!

#5

Yes I was thinking @steph_ren was confusing convolution with mixing.

If you have a linear combination of let’s say 2 groups

observation_A = number_1 * proportionA + number_2 * (1 - proportionA )
observation_B = number_1 * proportionB + number_2 * (1 - proportionB )
observation_C = number_1 * proportionC + number_2 * (1 - proportionC )
observation_D = number_1 * proportionD+ number_2 * (1 - proportionD)
observation_E = number_1 * proportionE + number_2 * (1 - proportionE )
observation_F = number_1 * proportionF + number_2 * (1 - proportionF )
...

Basically you infer values (parameter of size 2) non distributions. The posterior of such single two parameters with have an uncertainty associated. That will be your distribution.

#7

Hi Stemangiola,

Thanks again for the reply. I understand what you are saying now: i.e. infer the values, not the distributions. Then these values should follow the underlying distribution(s).

Maybe this is obvious, but unfortunately I’m still not entirely sure about is how I would go about inferring the values?

Thanks again!