Missing data in hierarchical linear model

Hi all,

I’m building a linear hierarchical model with missing data (see below). I modeled missing data for the dependent variable through indexing and combining the observed and missing components of the respective vector (following section 11.3 in the manual).

Did I implement this correctly? Would I model missing values for independent variables in the same way? Currently the model is not very fast, and more problematically, does not mix very well - any ideas on how to improve on this?

Any advice would be much appreciated!

int N; //number of level 1-observations
int M; //number of level 2-observations
int K; //number of fixed level-1 predictors
int R; //number of random level-1 predictors 
int U; //number of level 2 predictors
matrix[N,K] X_f; //level-1 variables, fixed effects
matrix[N,R] X_r; //level-1 variables, random effects; including column of ones for random intercept
matrix[M,U] X2; //level-2 variables, including column of ones for grand intercept 
int g[N]; //grouping variable
int<lower = 0> nmis_y; // number of missing y
int<lower = 1, upper = N> ii_obs[N-nmis_y]; // index, observed y
int<lower = 1, upper = N> ii_mis[nmis_y]; // index, missing y
real y_obs[N-nmis_y]; // observed y's
matrix[U,R] gamma; // level 2 predictors
corr_matrix[R] Omega; // correlation matrix
vector<lower=0>[R] sigma_l2; // variances, level 2
real<lower=0> sigma_l1; // variance, level 1
vector[K] beta_f; //level-1 predictors, fixed
vector[R] beta_r[M]; //level-1 predictors, random
real y_mis[nmis_y]; // missing y values
transformed parameters{
matrix[R,R] vcov_l2; // var-covariance matrix, level 2
real y[N]; // merged vector of observed and missing y
vector[N] xb_l1; // xbeta, level 1
vector[R] xb_l2[M]; // xbetas, level 2
// compute vcov values
vcov_l2 <- quad_form_diag(Omega, sigma_l2); 
// compute xbetas
for(n in 1:N) //level 1
  xb_l1[n] <- X_f[n]*beta_f+X_r[n]*beta_r[g[n]];
for (m in 1:M) //level 2
  xb_l2[m] <- (X2[m] * gamma)';
// merge observed and missing y
y[ii_obs] = y_obs;
y[ii_mis] = y_mis;
sigma_l1 ~ cauchy(0,1); 
sigma_l2 ~ cauchy(0,1); 
Omega ~ lkj_corr(R); 
to_vector(gamma) ~ normal(0, 10); 
beta_f ~ normal(0,10); 
beta_r ~ multi_normal(xb_l2, vcov_l2); 
y ~ normal(xb_l1, sigma_l1);

Indent your code—it’s much easier to read!

In terms of content, the approach you’re taking will work to inferring y[ii_mis], but it’s much more efficient given that the y[ii_mis] are conditionally independent here to just generate them in the generated quantities block using the RNG.

But the approach you take will generalize to missing predictors.

Bob, thank you!

Actually, would the code also be more efficient if I do not store the y values? Can this be done by commenting it out {} within the generated quantities block?

I can’t quite figure out how to implement your recommendation.

If I move y[ii_mis] = y_mis; to the generated quantities block I get the following error:

PARSER EXPECTED: <one of the following:
a variable declaration, beginning with type
(int, real, vector, row_vector, matrix, unit_vector,
simplex, ordered, positive_ordered,
corr_matrix, cov_matrix,
cholesky_corr, cholesky_cov
or a
or ‘}’ to close variable declarations and definitions>

If I also move the variable declaration, stan warns me that the model can’t be estimated because “y does not exist”.

Any ideas?

If you don’t store them, why generate them?

If you attach your whole program, we can help debug syntax. Otherwise, it’s too hard to guess what might be going on. This looks like a syntax error rather than a conceptual one.

Hi @bbecker , by any chance, do you remember what did you do with this? I’m working on a similar problem.