Posterior predictive check for linear multilevel model: am I doing it right?

Dear all

I am new to Stan, and as a learning exercise I wanted to estimate a Bayesian multilevel linear model for the sleepstudy dataset (in the lme4 package). The model have subject-specific intercepts and slopes, assumed to have a normal distribution across subjects, with (possibly) non-zero correlation. The model seems to run correctly, and I get estimates that are sufficiently close to what I obtain with lme4.

I wanted to simulate the model to do a posterior predictive check (estimating p \left( y^{\text{rep}} \mid y \right)). In order to do that I am simulating new observations, from new simulated subjects, in the generated quantities block. Is that the best way to proceed? It seems to work correctly (see plot; the blue bands represents the 95% percentile intervals of the distribution of simulated y^{\text{rep}})
Moreover, the only way I found to generate the random intercepts and slopes is with a loop. Is that the only way, or there is a more efficient way to generate directly a matrix of normal random variates with given variance-covariance matrix?

Thanks!

data {
  int<lower=1> N;                  // number of observations
  real RT[N];                      // reaction time (transformed in seconds)
  int<lower=0,upper=9> Days[N];    // predictor
  int<lower=1> J;                  // number of subjects
  int<lower=1,upper=J> Subject[N]; // subject id
}

parameters {
  vector[2] beta;               // fixed-effects parameters
  real<lower=0> sigma_e;        // residual std
  vector<lower=0>[2] sigma_u;   // random effects standard deviations
  cholesky_factor_corr[2] L_u;  // L_u is the Choleski factor of the correlation matrix
  matrix[2,J] z_u;              // random effect matrix
}

transformed parameters {
  matrix[2,J] u;
  u = diag_pre_multiply(sigma_u, L_u) * z_u; // use Cholesky to set correlation
}

model {
  real mu; // conditional mean of the dependent variable

  //priors
  L_u ~ lkj_corr_cholesky(2);   // LKJ prior for the correlation matrix
  to_vector(z_u) ~ normal(0,1); // before Cholesky, random effects are normal variates with SD=1
  sigma_u ~ cauchy(0, 0.5);     // SD of random effects (vectorized)
  sigma_e ~ cauchy(0, 0.5);     // prior for residual standard deviation
  beta[1] ~ normal(0.3, 1);     // prior for fixed-effect intercept
  beta[2] ~ normal(0.01, 0.5);  // prior for fixed-effect slope

  //likelihood
  for (i in 1:N){
    mu = beta[1] + u[1,Subject[i]] + (beta[2] + u[2,Subject[i]])*Days[i];
    RT[i] ~ normal(mu, sigma_e);
  }
}

generated quantities {
  // simulate model for posterior predictive check
  real y_rep[N];
  
  matrix[2,J] u_rep; // simulate 'random effects'
    for (j in 1:J) {
    u_rep[1:2,j] = multi_normal_rng(rep_vector(0,2), diag_matrix(sigma_u) * (L_u*L_u') * diag_matrix(sigma_u));
  }
  
  for (i in 1:N){
    y_rep[i] = normal_rng(beta[1] + u_rep[1,Subject[i]] + (beta[2] + u_rep[2,Subject[i]])*Days[i], sigma_e);
  }
}

ppc_sleepstudy.pdf (15.0 KB)

Yes, although you don’t need to do this J times like you currently are, all your J 'new ’ individuals come from the same distribution (there is nothing in the loop over J that is specific to each individual).

Loops are typically your best bet in the generated quantities block, in this instance you could use the fact you have the Cholesky and use multi_normal_cholesky_rng, i.e.

u_rep = multi_normal_cholesky_rng(rep_vector(0, 2), diag_matrix(sigma_u) * (L_u));

although I can’t remember off the top of my head if cholesky_factor_corr is upper or lower triangular, you may need the transpose in either case. Also, you can use diag_pre_multiply,

u_rep = multi_normal_cholesky_rng(rep_vector(0, 2), diag_pre_multiply(sigma_u,  L_u));

To avoid calling diag_matrix.

1 Like

I might write the generated quantities block like this:

generated quantities {
  vector[N] y_rep;

  {
    vector [2] u_rep;
    vector [N] temp_mu;
    
    u_rep = multi_normal_cholesky_rng(rep_vector(0, 2), diag_pre_multiply(sigma_u,  L_u));
    temp_mu = (beta[1] + u_rep[1]) + (beta[2] + u_rep[2]) * Days;

    // Could also do the following using normal_rng in a loop. 
    // Cholesky of diag matrix is just the diag matrix.
    // Use the multi_normal_cholesky_rng call so stan doesn't actually compute
    // the Cholesky.
    y_rep = multi_normal_cholesky_rng(temp_mu, diag_matrix(rep_vector(sigma_e, N)));

  }

}

(not guaranteed to compile and do the correct thing, but I think it should)

1 Like

Thanks a lot for your input! In my code I was generating a new dataset, with the same number of subjects, at each sampling iteration (and was stuck with the loop because I couldn’t find how to use multi_normal_cholesky_rng to fill a matrix - it works only with vectors right?). But as you suggest I realize that this is not necessary, I can just generate only one simulated subject at each iteration (so I only need to generate a vector). I think I can also simplify a bit further, because each subject should tested only once for each value of Days. This resolves my concerns I think.

Also, I tried your code, it does not compile because there are some issues which I don’t understand with the line where temp_mu is computed: No matches for: real * int[] and No matches for: real + ill formed

Checking your data block again, Days is an array of integers, not a vector, which make sense given the data. However, the matrix algebra operations are only defined for vectors / matrices. So you can either loop over the elements of Days and compute the element of temp_mu one at a time, or you can change the type / declaration of Days in the data block to vector<lower=0,upper=9>[N] Days;

I see, thanks!