I am anticipating that my question exposes some basic ignorance about how mcmc works, but here we go:
In an attempt to deal with missing data I am trying to simultaneously estimate a regression model as well as the co-variance of the variables in the regression model and some auxiliary variables.
To be specific: Assume matrix X has the columns y,e,m,el, the last entry being the product of variables e and l. Column y has missing values.
Now I want to estimate the effect of e on y^i and to deal with missing data I also estimate the co-variance matrix of X^i, where y^i and X^i are vector/matrix with imputed values.
A simple stan model looks like this:
data {
int N;
int K;
vector[N] y;
vector[N] m;
vector[N] e;
vector[N] el;
int nmiss;
int missidx[nmiss];
}
parameters {
real intercept;
real b;
real<lower=0> sigma;
corr_matrix[K] Omega;
vector[K] mu;
vector<lower=0>[K] tau;
vector[nmiss] yimp;
}
transformed parameters {
cov_matrix[K] Sigma = quad_form_diag(Omega,tau);
}
model {
// omitting priors to keep this shorter
vector[N] yl = y;
yl[missidx] = yimp;
target += normal_lpdf(yl | intercept + e*b,sigma);
for (n in 1:N) {
vector[K] vec = to_vector({yl[n],e[n],el[n],m[n]});
target += multi_normal_lpdf(vec | mu, Sigma);
}
}
When I try to estimate this model, I can’t simultaneously estimate the regression weight for e and the covariance matrix. There is some “interference” (for the lack of a better term) which makes that the regression weight for e and the variance of e in the variance-covariance matrix for X are underestimated.
I can “hack around” this problem by multiplying multi_normal_lpdf(vec | mu, Sigma)
with a large number. This works intuitively because it gives precedence to the estimation of the co-variation between all relevant variables and thus estimates the effect of e on y based on imputed data with the right covariance.
However, I assume others must have thought deeper about this and would be glad about any pointer or comment.
Related post with an implementation in brms here.