Estimating group level relationships from individual level data

Consider the following DAG:

b <- a -> c

The data for a are at the group level while for b,c the data are at the individual level. If I want to estimate the effect a group’s a has on average b, across several groups with varying a, my options are:

  • calculate mean b within groups and fit bf(b.mean ~ a)
  • calculate mean and sd of b withing groups and fit bf(b.mean|se(b.sd, sigma = TRUE) ~ a)
  • fit bf(b ~ a +(1|group))

These produce similar results.

Now let’s say I want to estimate the relationship that b and c have as descendants of a, across groups with varying a.

I expected that fitting bf(b ~ c + (1|group)) would be appropriate, but the estimate is very different from what I get from either bf(b.mean ~ c.mean) or from bf(b.mean|se(sd, sigma = TRUE) ~ me(c.mean, c.sd)).

I do not understand this. I know I am missing something fundamental, so if anyone can explain or point me toward something not too technical that would help me understand this, I would appreciate it.

I would also like to know if the the three options listed in the first case are reasonable or if I am wrong there too.

EDIT:
Ok, hear me out, how about this:

bf( b ~ 1 + c + (1|group) ) + 
bf( c ~ 1 + (1|group) ) + 
set_rescor(TRUE)
1 Like

Some dummy data to try things out:

set.seed(42)
d = data.frame(a = NA,
               b = NA,
               c = NA,
               g = rep(letters, 
                       times = sample(10:30, 
                                      26, 
                                      TRUE)))
for (i in letters) { 
d$a[d$g == i] = rnorm(1)
}
d$b = rnorm(nrow(d), 5 + 3*d$a, 2)
d$c = rnorm(nrow(d), 5 - 3*d$a, 2)
plot(d$b ~ d$a)
plot(d$c ~ d$a)
plot(d$c ~ d$b)


The first two models estimate the relationship of b or c with a, partially pooling deviations of groups from the expected b or c based on their a.

library(brms)

m.b_a = brm(data = d,
          formula = bf(b ~ a + (1|g)), 
          family = "gaussian", 
          backend = "cmdstanr",
          chains = 4, 
          iter = 4000, 
          cores = 4)
summary(m.b_a)

m.c_a = brm(data = d,
          formula = bf(c ~ a + (1|g)), 
          family = "gaussian", 
          backend = "cmdstanr",
          chains = 4, 
          iter = 4000, 
          cores = 4)
summary(m.c_a)

The third model estimates average b and c across groups. The correlated random intercepts capture the across groups correlation which is indeed ~ -1. The residual correlation is correlation of b and c within groups; they are unrelated.

m.bc = brm(data = d,
           formula = bf(c ~ 1 + (1|q|g))+
                     bf(b ~ 1 + (1|q|g))+
                     set_rescor(TRUE), 
           family = "gaussian", 
           backend = "cmdstanr",
           chains = 4, 
           iter = 4000, 
           cores = 4)
summary(m.bc)
Multilevel Hyperparameters:
~g (Number of levels: 26) 
                             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(c_Intercept)                  3.38      0.47     2.60     4.40 1.00     1518     2802
sd(b_Intercept)                  3.15      0.43     2.43     4.14 1.00     1529     2813
cor(c_Intercept,b_Intercept)    -0.99      0.01    -1.00    -0.96 1.00     2948     4005

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
c_Intercept     5.03      0.64     3.78     6.32 1.00      630     1312
b_Intercept     4.78      0.60     3.58     5.94 1.00      649     1449

Further Distributional Parameters:
        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_c     2.11      0.07     1.98     2.25 1.00    11661     6128
sigma_b     2.10      0.07     1.97     2.24 1.00    12944     5789

Residual Correlations: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
rescor(c,b)     0.01      0.05    -0.08     0.11 1.00    11481     6098

The last model estimates the relationship of c with b across groups. So the -1 estimate has now moved from the multilevel part of the model to the coefficient for b on c. The residual within group correlation is now positive (which I do not understand).

m.c_b = brm(data = d,
           formula = bf(c ~ 1 + b + (1|q|g))+
                     bf(b ~ 1 + (1|q|g))+
                     set_rescor(TRUE), 
           family = "gaussian", 
           backend = "cmdstanr",
           chains = 4, 
           iter = 4000, 
           control = list(adapt_delta = 0.95),
           cores = 4)
summary(m.c_b)

Multilevel Hyperparameters:
~g (Number of levels: 26) 
                             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(c_Intercept)                  0.63      0.80     0.02     3.02 1.00      858      721
sd(b_Intercept)                  3.28      0.49     2.47     4.40 1.00     3783     5005
cor(c_Intercept,b_Intercept)    -0.07      0.70    -1.00     0.98 1.02      162      969

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
c_Intercept     9.81      1.43     5.84    11.75 1.01      213      593
b_Intercept     4.78      0.65     3.51     6.03 1.00     3933     4660
c_b            -1.00      0.30    -1.40    -0.16 1.01      203      598

Further Distributional Parameters:
        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_c     3.04      0.34     2.21     3.67 1.01      245      739
sigma_b     2.10      0.07     1.96     2.24 1.00    14054     6047

Residual Correlations: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
rescor(c,b)     0.68      0.18     0.15     0.82 1.01      243      567

But overall, I think this model does what I originally thought that bf(b ~ c + (1|group)) would do.

Can I have a “yay” or “nay you fool, that’s not how any of this works”?