- Operating System: MacOS
- brms Version: 2.14.4
In the new Gelman et al (2020) text, we see the following simulation on pp. 159–160.
# define the sample size
N <- 100
# define the number of continuous variables
K <- 10
# make an array of K continuous variables
X <- array(runif(N * K, 0, 1), c(N, K))
# make an additional dummy predictor
z <- sample(c(0, 1), N, replace = TRUE)
# set the data-generating values
a <- 1
b <- 1:K
theta <- 5
sigma <- 2
# simulate the outcome
y <- a + X %*% b + theta * z + rnorm(N, 0, sigma)
# save the results in a data frame
fake <- data.frame(
X = X,
y = y,
z = z)
They then fit the full model with rstanarm::stan_glm()
like so.
fit1 <- rstanarm::stan_glm(y ~ X + z, data = fake)
The resulting summary shows a model fit with an intercept, a z
slope, and 10 additional slopes for X1
through X10
.
I’m not sure I have the proper language for this, so I’ll apologize up front. But is it possible to use this array-oriented syntax in a formula
argument within brm()
? My initial naive attempt failed.
# this fails
fit2 <- brms::brm(y ~ X + z, data = fake)
Instead, I get the following error message: “Error: The following variables are missing in ‘data’: ‘X’”
1 Like
Add X to the data as a matrix column and you should be good to go.
2 Likes
Huh. Thanks @paul.buerkner; it works! For those interested, here’s the updated code:
# first make the data frame without the X matrix
fake <- data.frame(
y = y,
z = z)
# then add in the X matrix by simple assignment
fake$X <- X
You can then check the data structure.
str(fake)
This returns:
'data.frame': 100 obs. of 3 variables:
$ y: num 34 44.2 39 29.5 40 ...
$ z: num 0 1 1 0 1 1 1 0 0 0 ...
$ X: num [1:100, 1:10] 0.705 0.56 0.992 0.213 0.254 ...
It turns out you can do this in one step with tibbles, like so:
fake <- tibble(
X = X,
y = y,
z = z)
Anyway, now the brms code yields the intended solution:
fit3 <- brms::brm(y ~ X + z, data = fake)
With the following summary:
Family: gaussian
Links: mu = identity; sigma = identity
Formula: y ~ X + z
Data: fake (Number of observations: 100)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 2.31 1.21 -0.09 4.69 1.00 7661 3260
X1 1.07 0.75 -0.41 2.52 1.00 6643 3322
X2 1.60 0.68 0.27 2.97 1.00 7937 3257
X3 1.46 0.88 -0.25 3.19 1.00 7336 3092
X4 3.22 0.71 1.83 4.62 1.00 6274 3012
X5 5.17 0.69 3.82 6.53 1.00 8376 2685
X6 7.24 0.70 5.85 8.58 1.00 7161 3138
X7 6.09 0.72 4.64 7.53 1.00 7521 3255
X8 7.88 0.78 6.33 9.38 1.00 8062 3187
X9 8.34 0.69 6.97 9.71 1.00 7447 3151
X10 10.75 0.73 9.31 12.20 1.00 6992 3321
z 4.67 0.42 3.85 5.49 1.00 8137 3220
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.96 0.15 1.70 2.28 1.00 5989 3049
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
1 Like
Very cool. Assuming you can do this with a hierarchical specification too? E.G.
brms::brm(y ~ X + (1 + X | group), data = data)