Hi all
Apologies if this is more of a generic modelling than Stan question but I couldn’t find much information elsewhere.
I want to estimate the correlation among different variables, effectively analogous to what is described here: http://www.sumsar.net/blog/2013/08/bayesian-estimation-of-correlation/. Certainly aware that there are easier ways to do this but this is a much simplified example of what I eventually want to do.
To illustrate my question, consider the following model that I have fit using RStan 2.9 (I know, I’m slow to adopt), where I know the mean and SD:
> load(url('https://dl.dropboxusercontent.com/s/j05l2gvgwy6chil/STANdata.Rdata'))
cormodel <- "
data{
int<lower=0> N;
matrix[18,N] y;
vector<lower=0>[N] sdd;
vector[N] muu;
}
parameters{
cholesky_factor_corr[N] LOmega;
}
model{
LOmega ~ lkj_corr_cholesky(2);
for (i in 1:18)
y[i] ~ multi_normal_cholesky(muu, diag_pre_multiply(sdd,LOmega));
}
generated quantities {
matrix[N,N] Omega;
Omega <- multiply_lower_tri_self_transpose(LOmega);
}
"
fit_model <- stan(model_code = cormodel, data = STANdata, pars = 'Omega', chains = 3, iter = 1000, warmup = 500, thin = 4, refresh = 100, cores = 3, open_progress = T)
My problem is that this model consistently underestimates the correlations in the data. For example, median(extract(fit_model,‘Omega[5,6]’)[[1]]) will often be about 0.85 when cor(STANdata$y)[5,6] = 0.96. The reason that I say that this might be a modelling rather than Stan q, is that I get very good (effectively spot on) estimates of the correlation when I reduce the data to a bivarate matrix as in the above blog post. But, as I add variables, all of the correlation coefficients become smaller and tend to zero (the problem isn’t of course limited to the only correlation mentioned above). I have tried playing with priors etc… but those don’t seem to make a difference. I suspect that there is a rather obvious explanation that I’m missing.
Thanks in advance for any insights you can offer!
[edit: code markdown]