I am new to Stan, and would appreciate any insights about why the simulation and model below do not work as expected. I have two bivariate normal random variables. I have some observations of just the first variable and some observations of the sum of the first and second variable. I would like to infer the covariance structure of the second variable and have endeavored to do so using a mixture model where the mixing ratio is known, i.e. for any observation I know whether it is just the 1st variable or if it is the sum of both variables.
I simulated data with known correlations between each pair of variables. Stan is getting the correlation for the first (directly observed) variable exactly right at 0.4. However, for the correlation of the second (mixed in) variable it is overestimating the correlation at 0.27 whereas I simulated 0.20. I am unsure if perhaps I made a beginner’s mistake in my model or perhaps have a fundamental misunderstanding of what correlation I should be getting for the simulated data.
Here is the R code to simulate the data:
library(MASS) # needed for mvrnorm
library(rstan)
# Parameters for multivariate normal data to simualte.
N = 20000 # number of samples
mu = c(0, 0, 0, 0) # means
s = c(1, 1, 1, 1) # standard deviations
rho = matrix(c(
1 , 0.4, 0 , 0 ,
0.4, 1 , 0 , 0 ,
0 , 0 , 1 , 0.2,
0 , 0 , 0.2, 1 ), 4) # correlation matrix
Sigma = diag(s) %*% rho %*% diag(s) # variance-covariance matrix
# Simulate from multivariate normal distribution
x = mvrnorm(N, mu = mu, Sigma = Sigma)
# Reasemble simulated data into y matrix of baseline and mixed (condition) data
# Mix only the second half of the observations
y = x[,1:2] + rbind(matrix(0, N/2, 2), x[(N/2+1):N,3:4])
cond = c(rep(0,N/2),rep(1,N/2))
# The J variable for stan is the number of columns in Y
J = dim(y)[2] # equals 2
# Dump data out to disk
stan_rdump(c('N', 'J', 'y', 'cond'), file="mixed.data.R")
And the stan code:
data {
int<lower=1> N; // number of observations
int<lower=1> J; // dimension of observations
vector[J] y[N]; // observations
int<lower=0, upper=1> cond[N]; // binary condition
}
parameters {
// Error scales
cholesky_factor_corr[J] L_corr;
cholesky_factor_corr[J] L_corr_cond;
vector<lower=0>[J] sigma;
vector<lower=0>[J] sigma_cond;
// Means
vector[J] beta;
vector[J] beta_cond;
}
model {
// Cholesky factor variance-covariance matrices.
matrix[J, J] L_Sigma;
matrix[J, J] L_Sigma_cond;
L_Sigma = diag_pre_multiply(sigma, L_corr);
L_Sigma_cond = diag_pre_multiply(sigma_cond, L_corr_cond);
// Priors
sigma ~ cauchy(0, 5);
sigma_cond ~ cauchy(0, 5);
L_corr ~ lkj_corr_cholesky(1);
L_corr_cond ~ lkj_corr_cholesky(1);
beta ~ normal(0,10);
beta_cond ~ normal(0,10);
// Mixed sampling distribution/likelihood of the observations.
for (i in 1:N) {
if (cond[i])
target += log_sum_exp(log(0.5) + multi_normal_cholesky_lpdf(y[i] | beta, L_Sigma),
log(0.5) + multi_normal_cholesky_lpdf(y[i] | beta_cond, L_Sigma_cond));
else
target += multi_normal_cholesky_lpdf(y[i] | beta, L_Sigma);
}
}
generated quantities {
matrix [J, J] Omega; // Correlation matrix
matrix [J, J] Omega_cond;
matrix [J, J] Sigma; // Variance Covariance matrix
matrix [J, J] Sigma_cond;
Omega = multiply_lower_tri_self_transpose(L_corr);
Omega_cond = multiply_lower_tri_self_transpose(L_corr_cond);
Sigma = quad_form_diag(Omega, sigma);
Sigma_cond = quad_form_diag(Omega_cond, sigma_cond);
}