Help understanding prior simulation in brms - dummy coding not giving expected result

Hello everyone,

I’ve run into a weird issue while designing teaching materials. Prior predictive simulation in brms is not demonstrating the behaviour that I expected. I am looking to better understand what brms is doing or to clear up a conceptual misunderstanding on my end.

The example I’m creating concerns coding categorical variables with dummy coding and how that can create problems when setting priors. To keep things concrete, I will walk through a reproducible example. Below I create a variable ‘x’ that has two levels: 0 and 1. We then create y as a continuous variable from a normal distribution where the mean is conditional on the level of x.


# Simulating data
x <- rep(c(0, 1), times = 50)
y <- rnorm(100, 10+(x*2), 1)

# Creating dataframe and changing x to factor
df <- data.frame(x, y)
df$x <- as.factor(df$x)

As the level ‘0’ will be represented by the Intercept, and the estimated effect of ‘1’ will be the difference from the Intercept, I would always expect any set of priors to have greater uncertainty regarding the average of y when x=1, because by definition it has the uncertainty of the prior on the Intercept and the uncertainty of the difference between them. I checked that was the case with a simple simulation below:

# Running a simple prior simulation
x0 <- rnorm(1e6, mean = 8, sd = 3)
x1 <- rnorm(1e6, mean = 8, sd = 3) + rnorm(1e6, mean = 0, sd = 2)

# Means are equal as expected
mean(x0) # 7.998458
mean(x1) # 7.995305

# SD of x1 is more uncertain, as expected
sd(x0) # 2.999817
sd(x1) # 3.605881

However, when I run a prior check using the brm() function, I don’t see that the uncertainty regarding the averages is different. I run the prior check using the code below.

prior_check <- brm(y ~ x,
                   data = df,
                   prior = c(
                     prior(normal(8,3), class = "Intercept"),
                     prior(normal(0,2), class = "b"),
                     prior(normal(0,3), class = "sigma")),
                   sample_prior = "only")

First I visualized the means using the following code, which produced the plot below it.


The uncertainty regarding the means looks about the same, which I didn’t understand, so I looked at the draws myself, which led to the same result:

# Getting a dataframe of draws
prior_df <- as_draws_df(prior_check)

# Getting the means - similar as expected
mean(prior_df$b_Intercept) # 7.97645
mean(prior_df$b_Intercept+prior_df$b_x1) # 8.072909

# Getting the sds, similar contrary to expectations
sd(prior_df$b_Intercept) # 3.175866
sd(prior_df$b_Intercept+prior_df$b_x1) # 3.11842

All of the materials I’ve read, as well as many posts on this website and others, refer to this problem of setting priors on dummy variables that I don’t seem to be encountering in brms. Any insight into why this is happening in brms and/or what I do not understand about setting priors would be awesome!

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brms_2.18.0 Rcpp_1.0.10

loaded via a namespace (and not attached):
  [1] nlme_3.1-152         matrixStats_0.60.1   xts_0.12.1          
  [4] threejs_0.3.3        rstan_2.21.2         tensorA_0.36.2      
  [7] tools_4.1.1          backports_1.2.1      utf8_1.2.2          
 [10] R6_2.5.1             DT_0.19              mgcv_1.8-36         
 [13] projpred_2.0.2       DBI_1.1.1            colorspace_2.0-2    
 [16] withr_2.5.0          tidyselect_1.2.0     gridExtra_2.3       
 [19] prettyunits_1.1.1    processx_3.5.2       Brobdingnag_1.2-6   
 [22] emmeans_1.6.3        curl_4.3.2           compiler_4.1.1      
 [25] cli_3.2.0            shinyjs_2.0.0        sandwich_3.0-1      
 [28] labeling_0.4.2       colourpicker_1.1.0   posterior_1.2.1     
 [31] scales_1.2.1         dygraphs_1.1.1.6     checkmate_2.0.0     
 [34] mvtnorm_1.1-2        ggridges_0.5.3       callr_3.7.0         
 [37] stringr_1.5.0        digest_0.6.27        StanHeaders_2.21.0-7
 [40] minqa_1.2.4          base64enc_0.1-3      pkgconfig_2.0.3     
 [43] htmltools_0.5.2      lme4_1.1-29          fastmap_1.1.0       
 [46] htmlwidgets_1.5.3    rlang_1.0.6          rstudioapi_0.13     
 [49] shiny_1.6.0          farver_2.1.0         generics_0.1.0      
 [52] zoo_1.8-9            jsonlite_1.7.2       crosstalk_1.1.1     
 [55] gtools_3.9.2         dplyr_1.0.8          distributional_0.2.2
 [58] inline_0.3.19        magrittr_2.0.3       loo_2.4.1           
 [61] bayesplot_1.8.1      Matrix_1.5-3         munsell_0.5.0       
 [64] fansi_0.5.0          abind_1.4-5          lifecycle_1.0.3     
 [67] stringi_1.6.2        multcomp_1.4-17      MASS_7.3-54         
 [70] pkgbuild_1.2.0       plyr_1.8.6           grid_4.1.1          
 [73] parallel_4.1.1       promises_1.2.0.1     crayon_1.4.1        
 [76] miniUI_0.1.1.1       lattice_0.20-44      splines_4.1.1       
 [79] ps_1.6.0             pillar_1.8.1         igraph_1.2.6        
 [82] boot_1.3-28          markdown_1.1         estimability_1.3    
 [85] shinystan_2.5.0      reshape2_1.4.4       codetools_0.2-18    
 [88] stats4_4.1.1         rstantools_2.1.1     glue_1.6.2          
 [91] V8_3.4.2             RcppParallel_5.1.4   nloptr_1.2.2.2      
 [94] vctrs_0.5.2          httpuv_1.6.2         gtable_0.3.0        
 [97] assertthat_0.2.1     ggplot2_3.4.1        mime_0.11           
[100] xtable_1.8-4         coda_0.19-4          later_1.3.0         
[103] survival_3.2-11      rsconnect_0.8.24     tibble_3.1.4        
[106] shinythemes_1.2.0    gamm4_0.2-6          TH.data_1.0-10      
[109] ellipsis_0.3.2       bridgesampling_1.1-2

Okay, so I was directed to the explanation for this by @tjmahr and given some code for setting the prior from @Solomon. Thanks so much for taking a look and helping out!

As is clearly stated in the docs, but I somehow missed, setting a prior on class = Intercept in brms is not setting a prior on the actual intercept of the linear model, but rather the temporary intercept of the centered design matrix. This is explained here under the heading “Parameterization of the population-level intercept.”

Using the code below results in the expected behaviour:

prior_check2 <- brm(y ~ 0 + Intercept + x,
                   data = df,
                   prior = c(
                     prior(normal(8,3), class = "b", coef = "Intercept"),
                     prior(normal(0,2), class = "b", coef = "x1"),
                     prior(normal(0,3), class = "sigma")),
                   sample_prior = "only")


Doing my best to understand the nuances of why placing the prior on the temporary intercept results in the behaviour above, and I’ll update if/when that happens.

EDIT: This is a further explanation of why this happens. I’m hoping that it may be useful to others in the future. I think it was certainly worth the time I took to figure it out.

TL:DR - The temporary intercept you are fitting priors to when you use class = Intercept in my example is actually a prior on the grand mean of y, like it would be if you used sum-coding (-0.5, 0.5) for the binary factor (because you are). This is only for my example, in which the data was balanced. Unbalanced counts of levels of a categorical predictor would lead to slightly different behaviour.

This is the Stan code generated by brms for the prior_check model fit in the original post.

// generated with brms 2.18.0
functions {
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int prior_only;  // should the likelihood be ignored?
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
transformed parameters {
  real lprior = 0;  // prior contributions to the log posterior
  lprior += normal_lpdf(b | 0, 2);
  lprior += normal_lpdf(Intercept | 8, 3);
  lprior += normal_lpdf(sigma | 0, 3)
    - 1 * normal_lccdf(0 | 0, 3);
model {
  // likelihood including constants
  if (!prior_only) {
    target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
  // priors including constants
  target += lprior;
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);

The important parts to remember here are in the transformed data block, where it specifies Xc which is a centered version of the design matrix X with one less column than the original design matrix. Then we transform some values inside the for loop, copied by itself below. As K=2 for us in this model, means_X is just a single number that is the mean of the second column in the design matrix, which is our factor X. The mean of that column is 0.5, because it contains an equal amount of 0s and 1s. The new design matrix Xc is now just a single column where we have taken the original column of 0s and 1s and subtracted 0.5 to turn it into a column of -0.5s and 0.5s, respectively.

for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
for (i in 2:2) {
    means_X[1] = 0.5;
    Xc[, 1] = X[, 2] - 0.5;

The line that fits the model passes the response variable Y, the centered design matrix Xc, a temporary Intercept, the population level effects b, and sigma. This means that the population-level estimate for the effect of x on y is done with the sum-coded version of my dummy-coded variable, and the Intercept estimated will be the grand mean between the two levels.

target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);

In the generated quantities block, we get back b_Intercept, which is the average for the first level of the factor. This is done by taking the temporary intercept, which is the grand mean, and

generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);

Because my data was balanced, the temporary intercept here was the grand mean between the two categories. We then subtract half of the estimated effect to get back to the b_Intercept reported by the model, which is the average of y when x=0, i.e., the first level of the factor.

This means for this example, I could have treated setting a prior on class=Intercept as if I were setting a prior on the grand mean between the two levels. If I’m understanding this correctly, however, if my levels were unbalanced, then the temporary intercept would be closer to whichever had more observations.