Fit a `brm()` model with an array of predictors

  • Operating System: MacOS
  • brms Version: 2.14.4

In the new Gelman et al (2020) text, we see the following simulation on pp. 159–160.

# define the sample size
N <- 100
# define the number of continuous variables
K <- 10
# make an array of K continuous variables
X <- array(runif(N * K, 0, 1), c(N, K)) 
# make an additional dummy predictor
z <- sample(c(0, 1), N, replace = TRUE) 

# set the data-generating values
a <- 1
b <- 1:K
theta <- 5
sigma <- 2

# simulate the outcome
y <- a + X %*% b + theta * z + rnorm(N, 0, sigma) 

# save the results in a data frame
fake <- data.frame(
  X = X, 
  y = y, 
  z = z)

They then fit the full model with rstanarm::stan_glm() like so.

fit1 <- rstanarm::stan_glm(y ~ X + z, data = fake)

The resulting summary shows a model fit with an intercept, a z slope, and 10 additional slopes for X1 through X10.

I’m not sure I have the proper language for this, so I’ll apologize up front. But is it possible to use this array-oriented syntax in a formula argument within brm()? My initial naive attempt failed.

# this fails
fit2 <- brms::brm(y ~ X + z, data = fake)

Instead, I get the following error message: “Error: The following variables are missing in ‘data’: ‘X’”

1 Like

Add X to the data as a matrix column and you should be good to go.

2 Likes

Huh. Thanks @paul.buerkner; it works! For those interested, here’s the updated code:

# first make the data frame without the X matrix
fake <- data.frame(
  y = y, 
  z = z)

# then add in the X matrix by simple assignment
fake$X <- X

You can then check the data structure.

str(fake)

This returns:

'data.frame':	100 obs. of  3 variables:
 $ y: num  34 44.2 39 29.5 40 ...
 $ z: num  0 1 1 0 1 1 1 0 0 0 ...
 $ X: num [1:100, 1:10] 0.705 0.56 0.992 0.213 0.254 ...

It turns out you can do this in one step with tibbles, like so:

fake <- tibble(
  X = X, 
  y = y, 
  z = z) 

Anyway, now the brms code yields the intended solution:

fit3 <- brms::brm(y ~ X + z, data = fake)

With the following summary:

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ X + z 
   Data: fake (Number of observations: 100) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     2.31      1.21    -0.09     4.69 1.00     7661     3260
X1            1.07      0.75    -0.41     2.52 1.00     6643     3322
X2            1.60      0.68     0.27     2.97 1.00     7937     3257
X3            1.46      0.88    -0.25     3.19 1.00     7336     3092
X4            3.22      0.71     1.83     4.62 1.00     6274     3012
X5            5.17      0.69     3.82     6.53 1.00     8376     2685
X6            7.24      0.70     5.85     8.58 1.00     7161     3138
X7            6.09      0.72     4.64     7.53 1.00     7521     3255
X8            7.88      0.78     6.33     9.38 1.00     8062     3187
X9            8.34      0.69     6.97     9.71 1.00     7447     3151
X10          10.75      0.73     9.31    12.20 1.00     6992     3321
z             4.67      0.42     3.85     5.49 1.00     8137     3220

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.96      0.15     1.70     2.28 1.00     5989     3049

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
1 Like

Very cool. Assuming you can do this with a hierarchical specification too? E.G.

brms::brm(y ~ X + (1 + X | group), data = data)