Understanding Multilevel modelling STAN code

[edit: escaped code]

Hello Everyone,

I am trying to understand the STAN code of mixed effect model from “brms” R package.

Here is the STAN code:

// generated with brms 2.20.4
functions {
  /* compute correlated group-level effects
   Args:
     z: matrix of unscaled group-level effects
    SD: vector of standard deviation parameters
     L: cholesky factor correlation matrix
   Returns:
    matrix of scaled group-level effects
  */
    matrix scale_r_cor(matrix z, vector SD, matrix L) {
      // r is stored in another dimension order than z
      return transpose(diag_pre_multiply(SD, L) * z);
    }
}

data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int<lower=1> Kc;  // number of population-level effects after centering

  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}

transformed data {
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}

parameters {
  vector[Kc] b;  // regression coefficients
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
}

transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1;
  vector[N_1] r_1_2;
  real lprior = 0;  // prior contributions to the log posterior

  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_1 = r_1[, 1];
  r_1_2 = r_1[, 2];
  lprior += student_t_lpdf(Intercept | 3, 288.7, 59.3);
  lprior += student_t_lpdf(sigma | 3, 0, 59.3)  - 1 * student_t_lccdf(0 | 3, 0, 59.3);
  lprior += student_t_lpdf(sd_1 | 3, 0, 59.3)   - 2 * student_t_lccdf(0 | 3, 0, 59.3);
  lprior += lkj_corr_cholesky_lpdf(L_1 | 1);
}

model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    }
    target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}

generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}

I would really appreciate if anyone could explain the below code and provide reference if you have.

/* compute correlated group-level effects
  Args:
    z: matrix of unscaled group-level effects
    SD: vector of standard deviation parameters
    L: cholesky factor correlation matrix


 matrix scale_r_cor(matrix z, vector SD, matrix L) {
      // r is stored in another dimension order than z
      return **transpose(diag_pre_multiply(SD, L) * z)**;
    }

Many thanks for your help and precious time.

Marimuthu

The following expression

transpose(diag_pre_multiply(SD, L) * z

The top-level story is that we’re working with covariance matrices that are parameterized by Cholesky factors of their correlation matrix and a vector of standard deviations. The above expression converts standard normal variates z to variates with covariance whose Cholesky factor is diag_matrix(SD) * L = diag_pre_multiply(SD, L). When you diagonally pre-multiply a Cholesky factor for correlation by a diagonal matrix of scales, you get the Cholesky factor for covariance. When you transpose and multiply by a vector of standard normals z, you get a variable y = transpose(diag_pre_multiply(SD, L) * z such that y ~ multi_normal(0, (diag(SD) * L) * (diag(SD) * L)'). That is, y has the covariance defined by the Cholesky factor.

For reference, I’d suggest the Stan User’s Guide chapter on regression, which goes over this parameterization in detail. This also goes over the LKJ prior we use on correlation matrices (which concentrates toward a unit matrix).

In terms of the Cholesky factor form of generating variables, Wikipedia covers that:

For more on the LKJ prior, see this paper of @bgoodri’s:

http://www.stat.columbia.edu/~gelman/research/unpublished/Visualization.pdf

1 Like

Thank you, Bob. Your explanation has been very enlightening.

Marimuthu.