Use normal_id_glm_lpdf in brms for fixed + random effects model?

I’ve been trying to think about how you could rephrase models to use the normal_id_glm_lpdf since I’ve seen it’s quite a bit faster now (and we can use the GPU stuff with it)

For brms, the following formula outputs the stan code below

fake_data = data.frame(blah = 1:10, goo = rep(c("A", "B"), 5), shoo = rep(1:2, 5))
make_stancode(blah ~ shoo + (shoo | goo), fake_data)
// generated with brms 2.10.3
functions {
}
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  // temporary intercept for centered predictors
  real Intercept;
  real<lower=0> sigma;  // residual SD
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  // cholesky factor of correlation matrix
  cholesky_factor_corr[M_1] L_1;
}
transformed parameters {
  // actual group-level effects
  matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1 = r_1[, 1];
  vector[N_1] r_1_2 = r_1[, 2];
}
model {
  // initialize linear predictor term
  vector[N] mu = Intercept + Xc * b;
  for (n in 1:N) {
    // add more terms to the linear predictor
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
  }
  // priors including all constants
  target += student_t_lpdf(Intercept | 3, 6, 10);
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10);
  target += student_t_lpdf(sd_1 | 3, 0, 10)
    - 2 * student_t_lccdf(0 | 3, 0, 10);
  target += normal_lpdf(to_vector(z_1) | 0, 1);
  target += lkj_corr_cholesky_lpdf(L_1 | 1);
  // likelihood including all constants
  if (!prior_only) {
    target += normal_lpdf(Y | mu, sigma);
  }
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
  // group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}

Below I’ve taken this and tried to smoosh some stuff around to get everything to use normal_id_glm_lpdf and was wondering if it looks valid. If anyone can confirm, “This probably looks fine” I’ll sit down this week to check it can recover parameters the same as the current brms code.

Everywhere that I’ve changed things I put a DIFF with a note saying what was different between the two

// generated with brms 2.10.3
functions {
}
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  // DIFF
  matrix[N, Kc + 2] all_x_stuff; // put Xc, Z_1_1, and Z_1_2 in one matrix
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
  // DIFF: Dump everything into one matrix
  all_x_stuff[, 1:Kc] = Xc;
  all_x_stuff[, Kc + 1] = Z_1_1;
  all_x_stuff[, Kc + 1] = Z_1_2;
}
parameters {
  vector[Kc] b;  // population-level effects
  // temporary intercept for centered predictors
  real Intercept;
  real<lower=0> sigma;  // residual SD
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  // cholesky factor of correlation matrix
  cholesky_factor_corr[M_1] L_1;
}
transformed parameters {
  // actual group-level effects
  matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1 = r_1[, 1];
  vector[N_1] r_1_2 = r_1[, 2];
  // DIFF: Dump all the above into one big vector
  vector[Kc + N_1 + N_1] all_b_stuff;
  for (i in 1:Kc) {
    all_b_stuff[i] = b[i];
  }
  for (i in 1:N_1) {
    all_b_stuff[Kc + i] = r_1[i, 1];
  }
  for (i in 1:N_1) {
    all_b_stuff[Kc + N_1 + i] = r_1[i, 2];
  }

}
model {
  // DIFF: Don't do the loop here
  // priors including all constants
  target += student_t_lpdf(Intercept | 3, 6, 10);
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10);
  target += student_t_lpdf(sd_1 | 3, 0, 10)
    - 2 * student_t_lccdf(0 | 3, 0, 10);
  target += normal_lpdf(to_vector(z_1) | 0, 1);
  target += lkj_corr_cholesky_lpdf(L_1 | 1);
  // likelihood including all constants
  if (!prior_only) {
    // DIFF: Just use normal_id_glm_lpdf instead of normal_lpdf
    target += normal_id_glm_lpdf(Y | all_x_stuff, Intercept, all_b_stuff, sigma);
  }
}
generated quantities {
  // actual population-level intercept
  // DIFF: subset by Kc part of all_b_stuff
  real b_Intercept = Intercept - dot_product(means_X, all_b_stuff[1:Kc]);
  // group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}

It feels like I’m only shuffling around the data to the form I want (which was my intent), but I’m not sure if this is mathematically wonky.

Doing models like this though tends to be pretty fast for large N and M. I think the r_1 is wonk though. I have another scheme where just the fixed effects use normal_id_glm_lpdf and the random effects are still done the way brms does them right now.

For the halfway scheme, you can do transformed params like

transformed parameters {
  vector[N] r_1_sorted;
  vector[N] r_2_sorted;
  // actual group-level effects
  matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1 = r_1[, 1];
  vector[N_1] r_1_2 = r_1[, 2];
  // Sort everything here
  for (n in 1:N) {
    r_1_sorted[n] = r_1_1[J_1[n]];
    r_2_sorted[n] = r_1_2[J_1[n]];
  }
}

Then in the model block you can use normal_id_glm_lpdf on the fixed effects and normal_lpdf on the randoms

model {
  vector[N] mu = r_1_sorted .* Z_1_1 + r_2_sorted .* Z_1_2;
  // priors including all constants
  target += student_t_lpdf(Intercept | 3, 3, 10);
  target += student_t_lpdf(sigma | 3, 0, 10) - 1 * student_t_lccdf(0 | 3, 0, 10);
  target += student_t_lpdf(sd_1 | 3, 0, 10) - 2 * student_t_lccdf(0 | 3, 0, 10);
  target += normal_lpdf(to_vector(z_1) | 0, 1);
  target += lkj_corr_cholesky_lpdf(L_1 | 1);
  // likelihood including all constants
  target += normal_id_glm_lpdf(Y | XQ, Intercept, bQ, sigma);
  target += normal_lpdf(Y | mu, sigma);
}

I think that makes sense?

  • Operating System: Ubuntu 18.04
  • brms Version: 2.10.3

Apologies looking at 1 it’s def silly, but (2) should work.

I think there’s some way to do 1 cleverly with two calls to the glm function though

The random effects can go in the intercept term I think. I think in some models we talk about these as being intercept terms. Like group level intercepts, etc.

So originally we had:

target += normal_lpdf(Y | mu, sigma);

But mu was fixed effects + random effects like here:

// initialize linear predictor term
vector[N] mu = Intercept + Xc * b;
for (n in 1:N) {
  // add more terms to the linear predictor
  mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
}

Now just do:

// initialize linear predictor term
vector[N] random_effects = rep_vector(Intercept, N); // Removed Xc * b
for (n in 1:N) {
  // add more terms to the linear predictor
  random_effects[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
}

And call:

target += normal_id_glm_lpdf(Y | Xc, random_effects, b, sigma);

And I played around with this (Edit: I didn’t play around with the glm thing, just to be clear, I’m adding a separator above to indicate this is a different topic now) a while back but I think there’s a way we might be able to make the random effects calculation more efficient.

It’s kinda weird to talk about it being more efficient cause it’s already an incredibly fast thing (just looping through and doing a little simple addition), but check here: Group orderings in regression

The motivation there is that, maybe we have 100000 people. That means we have to compute 100000 random effects. But what if the difference in most of all these people is just one thing? So we could take the nth person’s value and then just adjust it to compute the (n + 1)th person’s value and that’s cheaper? Maybe there’s a way we can reorder the calculations to do a lot less math?

I am not sure which of the two versions are to be preferred. The first by @stevebronder (including random effects in b) requires a lot of sparse matrix stuff because the the all_x_stuff will be super sparse.

The proposal by @bbbales2 has the problem that we only get the speed up benefit for the fixed effects and not for the random effects so I am not sure how much speed up this ends up being in the end.