Okay, so I was directed to the explanation for this by @tjmahr and given some code for setting the prior from @Solomon. Thanks so much for taking a look and helping out!
As is clearly stated in the docs, but I somehow missed, setting a prior on class = Intercept
in brms is not setting a prior on the actual intercept of the linear model, but rather the temporary intercept of the centered design matrix. This is explained here under the heading “Parameterization of the population-level intercept.”
Using the code below results in the expected behaviour:
prior_check2 <- brm(y ~ 0 + Intercept + x,
data = df,
prior = c(
prior(normal(8,3), class = "b", coef = "Intercept"),
prior(normal(0,2), class = "b", coef = "x1"),
prior(normal(0,3), class = "sigma")),
sample_prior = "only")
plot(conditional_effects(prior_check2))
Doing my best to understand the nuances of why placing the prior on the temporary intercept results in the behaviour above, and I’ll update if/when that happens.
EDIT: This is a further explanation of why this happens. I’m hoping that it may be useful to others in the future. I think it was certainly worth the time I took to figure it out.
TL:DR - The temporary intercept you are fitting priors to when you use class = Intercept
in my example is actually a prior on the grand mean of y, like it would be if you used sum-coding (-0.5, 0.5) for the binary factor (because you are). This is only for my example, in which the data was balanced. Unbalanced counts of levels of a categorical predictor would lead to slightly different behaviour.
This is the Stan code generated by brms for the prior_check
model fit in the original post.
// generated with brms 2.18.0
functions {
}
data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
int prior_only; // should the likelihood be ignored?
}
transformed data {
int Kc = K - 1;
matrix[N, Kc] Xc; // centered version of X without an intercept
vector[Kc] means_X; // column means of X before centering
for (i in 2:K) {
means_X[i - 1] = mean(X[, i]);
Xc[, i - 1] = X[, i] - means_X[i - 1];
}
}
parameters {
vector[Kc] b; // population-level effects
real Intercept; // temporary intercept for centered predictors
real<lower=0> sigma; // dispersion parameter
}
transformed parameters {
real lprior = 0; // prior contributions to the log posterior
lprior += normal_lpdf(b | 0, 2);
lprior += normal_lpdf(Intercept | 8, 3);
lprior += normal_lpdf(sigma | 0, 3)
- 1 * normal_lccdf(0 | 0, 3);
}
model {
// likelihood including constants
if (!prior_only) {
target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
}
// priors including constants
target += lprior;
}
generated quantities {
// actual population-level intercept
real b_Intercept = Intercept - dot_product(means_X, b);
}
The important parts to remember here are in the transformed data block, where it specifies Xc
which is a centered version of the design matrix X
with one less column than the original design matrix. Then we transform some values inside the for loop, copied by itself below. As K=2 for us in this model, means_X
is just a single number that is the mean of the second column in the design matrix, which is our factor X
. The mean of that column is 0.5
, because it contains an equal amount of 0s and 1s. The new design matrix Xc
is now just a single column where we have taken the original column of 0s and 1s and subtracted 0.5 to turn it into a column of -0.5s and 0.5s, respectively.
for (i in 2:K) {
means_X[i - 1] = mean(X[, i]);
Xc[, i - 1] = X[, i] - means_X[i - 1];
}
for (i in 2:2) {
means_X[1] = 0.5;
Xc[, 1] = X[, 2] - 0.5;
}
The line that fits the model passes the response variable Y
, the centered design matrix Xc
, a temporary Intercept
, the population level effects b
, and sigma
. This means that the population-level estimate for the effect of x
on y
is done with the sum-coded version of my dummy-coded variable, and the Intercept estimated will be the grand mean between the two levels.
target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
In the generated quantities block, we get back b_Intercept, which is the average for the first level of the factor. This is done by taking the temporary intercept, which is the grand mean, and
generated quantities {
// actual population-level intercept
real b_Intercept = Intercept - dot_product(means_X, b);
}
Because my data was balanced, the temporary intercept here was the grand mean between the two categories. We then subtract half of the estimated effect to get back to the b_Intercept
reported by the model, which is the average of y
when x=0
, i.e., the first level of the factor.
This means for this example, I could have treated setting a prior on class=Intercept
as if I were setting a prior on the grand mean between the two levels. If I’m understanding this correctly, however, if my levels were unbalanced, then the temporary intercept would be closer to whichever had more observations.