Stan overestimating covariance in multivariate normal mixture model

I am new to Stan, and would appreciate any insights about why the simulation and model below do not work as expected. I have two bivariate normal random variables. I have some observations of just the first variable and some observations of the sum of the first and second variable. I would like to infer the covariance structure of the second variable and have endeavored to do so using a mixture model where the mixing ratio is known, i.e. for any observation I know whether it is just the 1st variable or if it is the sum of both variables.

I simulated data with known correlations between each pair of variables. Stan is getting the correlation for the first (directly observed) variable exactly right at 0.4. However, for the correlation of the second (mixed in) variable it is overestimating the correlation at 0.27 whereas I simulated 0.20. I am unsure if perhaps I made a beginner’s mistake in my model or perhaps have a fundamental misunderstanding of what correlation I should be getting for the simulated data.

Here is the R code to simulate the data:

library(MASS) # needed for mvrnorm

# Parameters for multivariate normal data to simualte.
N = 20000 # number of samples
mu = c(0, 0, 0, 0) # means
s = c(1, 1, 1, 1) # standard deviations
rho = matrix(c(
               1  , 0.4, 0  , 0  ,
               0.4, 1  , 0  , 0  ,
               0  , 0  , 1  , 0.2,
               0  , 0  , 0.2, 1   ), 4) # correlation matrix
Sigma = diag(s) %*% rho %*% diag(s) # variance-covariance matrix

# Simulate from multivariate normal distribution
x = mvrnorm(N, mu = mu, Sigma = Sigma)

# Reasemble simulated data into y matrix of baseline and mixed (condition) data
# Mix only the second half of the observations
y = x[,1:2] + rbind(matrix(0, N/2, 2), x[(N/2+1):N,3:4])
cond = c(rep(0,N/2),rep(1,N/2))

# The J variable for stan is the number of columns in Y
J = dim(y)[2] # equals 2

# Dump data out to disk
stan_rdump(c('N', 'J', 'y', 'cond'), file="")

And the stan code:

data {
  int<lower=1> N;                // number of observations
  int<lower=1> J;                // dimension of observations
  vector[J] y[N];                // observations
  int<lower=0, upper=1> cond[N]; // binary condition

parameters {
  // Error scales
  cholesky_factor_corr[J] L_corr;
  cholesky_factor_corr[J] L_corr_cond;
  vector<lower=0>[J] sigma;
  vector<lower=0>[J] sigma_cond;

  // Means
  vector[J] beta;
  vector[J] beta_cond;

model {
  // Cholesky factor variance-covariance matrices.
  matrix[J, J] L_Sigma;
  matrix[J, J] L_Sigma_cond;
  L_Sigma = diag_pre_multiply(sigma, L_corr);
  L_Sigma_cond = diag_pre_multiply(sigma_cond, L_corr_cond);

  // Priors
  sigma ~ cauchy(0, 5);
  sigma_cond ~ cauchy(0, 5);
  L_corr ~ lkj_corr_cholesky(1);
  L_corr_cond ~ lkj_corr_cholesky(1);
  beta ~ normal(0,10);
  beta_cond ~ normal(0,10);

  // Mixed sampling distribution/likelihood of the observations.
  for (i in 1:N) {
    if (cond[i])
      target += log_sum_exp(log(0.5) + multi_normal_cholesky_lpdf(y[i] | beta, L_Sigma),
                            log(0.5) + multi_normal_cholesky_lpdf(y[i] | beta_cond, L_Sigma_cond));
      target +=  multi_normal_cholesky_lpdf(y[i] | beta, L_Sigma);

generated quantities {
  matrix [J, J] Omega; // Correlation matrix
  matrix [J, J] Omega_cond;
  matrix [J, J] Sigma; // Variance Covariance matrix
  matrix [J, J] Sigma_cond;
  Omega = multiply_lower_tri_self_transpose(L_corr);
  Omega_cond = multiply_lower_tri_self_transpose(L_corr_cond);
  Sigma = quad_form_diag(Omega, sigma);
  Sigma_cond = quad_form_diag(Omega_cond, sigma_cond);


Just a thought : have you looked at the shape of the posterior distribution of the correlation of interest? If the posterior is skewed toward highest values or close to be multimodal, I could easily imagine the mean being at 0.27 while the mode actually being around 0.20!

All the best!

I’m probably missing something, but isn’t

target += log_sum_exp(log(0.5) + multi_normal_cholesky_lpdf(y[i] | beta, L_Sigma),
                            log(0.5) + multi_normal_cholesky_lpdf(y[i] | beta_cond, L_Sigma_cond));

the log density of a mixture of normal distributions, whereas you want the density of a sum of normal distribution? (it called when cond[i] = 1, i.e when y[i] has been sampled from the sum).

1 Like

@ldeschamps, that is a good thought but Omega_cond[1,2] really is distributed around 0.27. See included histogram.


@LeoGrin, could you elaborate a little bit more about the difference between a mixture of normal distributions with equal proportions vs a sum of normal distributions? It sounds like I have made a rookie mistake in specifying my model’s likelihood. What kind of stan code are you thinking of for a sum of normals likelihood?

Edit: @LeoGrin is right. My observation is the sum of two gaussian random variables. The mixture model assumes the observation is one variable or the other with a certain mixing probability – but never the sum. Obviously, then, my model is going to give me unexpected results. Thank you for helping me reason through this!

I am still having some trouble understanding how to model a sum of Gaussians, for which I have started a separate thread.