Index approach for categorical predictors in a multi-level model, featuring continuous variables, in brms

Hello,
I am trying to use the index approach for categorical predictors in a model, that should include factors and continuous variables, with a nested design.

Here is my code for generating the data and model using the indicator approach:

# Generation of data for example:
SuperFamily = sample(c("a","b","c"),10, replace=TRUE)
Family = sample(c("l","m","n","o","p"),10, replace=TRUE)
y = sample(c(1,0),10, replace=TRUE)
x11 = rnorm(10,1,0.3)
x22 = rnorm(10,1,0.2)
factor1 = sample(c("h","i","j","k"),10, replace=TRUE)
factor2 = sample(c("e","f","g"),10, replace=TRUE)

# Min-max normalization of continuous predictors:
x1 = (x11-min(x11))/(max(x11)-min(x11))
x2 = (x22-min(x22))/(max(x22)-min(x22))

# Combine the vectors in a dataframe:
dstan <- as.data.frame(cbind(SuperFamily, Family, y, x1, x2, factor1, factor2))

# Classification of variables:
dstan$SuperFamily = as.factor(dstan$SuperFamily)
dstan$Family = as.factor(dstan$Family)
dstan$y = as.integer(dstan$y)
dstan$x1 = as.numeric(dstan$x1)
dstan$x2 = as.numeric(dstan$x2)
dstan$factor1 = as.factor(dstan$factor1)
dstan$factor2 = as.factor(dstan$factor2)

str(dstan)

library(brms)

model1 <- brm(
  y ~ x1 + x2 + factor1 + factor2 + (1 | SuperFamily/Family),
  data = dstan,
  family = bernoulli(link = "logit"),
  prior = c(
    prior(normal(0.5, 0.25), "b"),
    prior(cauchy(0, 1), class = "sd")
  ),
  chains = 4, iter = 450, warmup = 150, cores = getOption("mc.cores", 4L), control = list(max_treedepth = 12, adapt_delta = 0.99), seed=1
)

I have tried to transform the above model for index approach in two ways, according to what I have found here, here and here:

Way 1: Changing global intercept to 0:

model1.2 <- brm(
  y ~ 0 + x1 + x2 + factor1 + factor2 + (1 | SuperFamily/Family),
  data = dstan,
  family = bernoulli(link = "logit"),
  prior = c(
    prior(normal(0.5, 0.25), "b"),
    prior(cauchy(0, 1), class = "sd")
  ),
  chains = 4, iter = 450, warmup = 150, cores = getOption("mc.cores", 4L), control = list(max_treedepth = 12, adapt_delta = 0.99), seed=1
)

However, using “0” as a global intercept might not be an appropriate solution for a model with continuous variables and multiple categorical predictors. Some of the problems I’ve found were:

  1. The posterior estimates don’t include a global intercept (e.g., for x=0, I would expect the intercept to equal something else than 0, as 0 is not the mean of y);
  2. Not all the levels of “factor2” appear among the posterior estimated parameters.

Way 2: Using bf() function with different intercepts for each variable:

model1.3 <- brm(
  formula = bf(
    y~ a*x1 + b*x2 + c + d + e ,
    a ~ 1 + x1,
    b ~ 1 + x2,
    c ~ 0 + factor1,
    d ~ 0 + factor2,
    e ~ 0 + (1|SuperFamily/Family),
    nl = TRUE),
  data = dstan,
  family = bernoulli(link = "logit"),
  prior = prior(cauchy(0, 1), nlpar = "e", class="sd") +
    prior(normal(0.5, 0.25), nlpar = "a") +
    prior(normal(0.5, 0.25), nlpar = "b") +
    prior(normal(0.5, 0.25), nlpar = "c") +
    prior(normal(0.5, 0.25), nlpar = "d"),
  chains = 4, iter = 450, warmup = 150, cores = getOption("mc.cores", 4L), control = list(max_treedepth = 12, adapt_delta = 0.99), seed=1
)

Additionally, I’m not sure that I’m using the right syntax in bf():

  • Matching an intercept for each continuous variable, instead of having a global intercept, feels particularly strange.

  • Also, is the formula transformed correctly from the original model (model1)?

I would be happy to hear your opinion on my solutions and will be thankful if you could help me correctly convert my model (i.e., “model1”) to index approach in brms.

Many thanks in advance!

I am sorry, but I don’t understand yet what you are trying to achieve. What do you mean by “indexing approach” exactly?

1 Like

Thanks for your reply!

I would like the posterior to present the following estimates:
Global intercept,
b_x1,
b_x2,
b_factor1_level1,
b_factor1_level2 etc. - for ALL the levels of this factor,
b_factor2_level1 etc. - for ALL the levels of this factor.

The bottom line is that I don’t want the global intercept to “borrow” levels from factors (as in the indicator approach), but rather each factor level to have a parameter estimate of its own.

I hope it’s clearer now.

Thanks again!

I am not sure if what you are trying to do is reasonable, but perhaps the least verbose way to achieve cell-mean coding for the two factors would be

bf(
    y ~ eta + a + b,
    eta ~ 1 + x1 + x2 + (1|SuperFamily/Family),
    a ~ 0 + factor1,
    b ~ 0 + factor2,
    nl = TRUE)
1 Like

Thank you very much, I truly appreciate your help and this elegant solution!

Here is how my code looks now, according to your suggestion:

new_model <- brm(
  formula = bf(
    y ~ eta + a + b,
    eta ~ 1 + x1 + x2 + (1|SuperFamily/Family),
    a ~ 0 + factor1,
    b ~ 0 + factor2,
    nl = TRUE),
  data = dstan,
  family = bernoulli(link = "logit"),
  prior = prior(cauchy(0, 1), nlpar = "eta", class="sd") +     
    prior(normal(0.5, 0.25), nlpar = "eta", coef = "Intercept") +
    prior(normal(0.5, 0.25), nlpar = "eta", coef = "x1") +
    prior(normal(0.5, 0.25), nlpar = "eta", coef = "x2") +
    prior(normal(0.5, 0.25), nlpar = "a") +
    prior(normal(0.5, 0.25), nlpar = "b"),
  chains = 4, iter = 450, warmup = 150, cores = getOption("mc.cores", 4L), control = list(max_treedepth = 12, adapt_delta = 0.99), seed=1
)

P.S. By “not reasonable”, did you mean the syntax of my original code (model1) or my attempts to transform it (model1.2 etc.)? Or was is the whole idea of using the index approach instead of indicator approach?
In any case, I’d be happy to hear why do you think so.

Many thanks again!

Model2 was not reasonable as it had 2 intercepts. The whole approach is not equivalent to the original case as you have 1 more coefficient because two factors are cell mean coded in contrast to 1 intercept + 2 dummy coded factors. The latter is just different not necessarily unreasonable.

1 Like

Understood, thanks for clarifying!

P. P. S.
Dear @paul.buerkner ,
If the categorical variables (i.e., factor 1 and factor2) also have a nested structure, how should I implement this in the code you have suggested?

If I understand correctly, this is a case of independent random intercept and random slopes (reference).

Is it correct to code the random slopes as below?

bf(
    y ~ eta + a + b,
    eta ~ 1 + x1 + x2 + (1|SuperFamily/Family),
    a ~ 0 + factor1 + (0 + factor1|SuperFamily/Family),
    b ~ 0 + factor2 + (0 + factor2|SuperFamily/Family),
    nl = TRUE)

Thanks in advance!