Difference between VarCorr and correlation implied by posterior_linpred

The results of VarCorr should AFAIK be the fitted covariance/correlation matrix for group-level effects. In a model where those effects are the only term, one would expect to get exactly the same covariance/correlation structure by computing the linear predictor for a large number of datapoints for each sample and computing the empirical correlation in each sample (i.e. the predictors should be exactly draws from the relevant multivariate normal, so taking many such draws should recover the correlation covariance). This however gives different numbers in practice.

So maybe I am missing something? Or do I have a bug?

Code to reproduce this:

library(brms)
library(tidyverse)
options(mc.cores = parallel::detectCores(), brms.backend = "cmdstanr")

set.seed(58232134)
N_ids <- 20
N_groups <- 3
test_data <- crossing(data.frame(id = paste0("id",1:N_ids), idmod = rnorm(N_ids)), 
                      data.frame(group = paste0("g", 1:N_groups), groupmod = rnorm(N_groups)),
                      data.frame(rep = c(1:3))) %>%
  mutate(response = rnorm(n(), groupmod + idmod))

fit <- brm(response ~ 0 + (0 + group | id), data = test_data)

VarCorr(fit)$id$cor[,"Estimate", ]

Gives:

          groupg1   groupg2   groupg3
groupg1 1.0000000 0.3642701 0.9348216
groupg2 0.3642701 1.0000000 0.4477652
groupg3 0.9348216 0.4477652 1.0000000

Now, computing via the linear predictor:

all_groups <- unique(test_data$group)
n_reps <- 10000
prediction_data <- crossing(id = paste0("new_id", 1:n_reps), group = all_groups)
linpred <- posterior_linpred(fit, newdata = prediction_data, allow_new_levels = TRUE)

n_samples <- dim(linpred)[1]
per_sample_cor <- array(NA_real_, c(n_samples, length(all_groups), length(all_groups)  ))
for(s in 1:n_samples) {
  predictors_wide <- array(NA_real_, c(n_reps, length(all_groups)))
  for(g in 1:length(all_groups)) {
    predictors_wide[ , g] <- linpred[s, prediction_data$group == all_groups[g]]
  }
  per_sample_cor[s, , ] <- cor(predictors_wide)
}

apply(per_sample_cor, MARGIN = c(2,3), FUN = mean)

which gives:

          [,1]      [,2]      [,3]
[1,] 1.0000000 0.4954658 0.8642596
[2,] 0.4954658 1.0000000 0.6006337
[3,] 0.8642596 0.6006337 1.0000000

Any ideas, what is causing the difference?

Martin, did you get to the bottom of this? It sounds very suspicious…

I unfortunately still have no idea…

Tagging @paul.buerkner for additional insights…

I will take a closer look next week :-)

1 Like