Shrinkage of contrasts in brms

Say we have the following data: Gaussian outcome and a single nominal categorical predictor with 4 levels.

I’m interested in fitting the model in a way that shrinks all 6 pairwise mean differences to 0. Is there a way to accomplish this in brms? For example, if we used cell means coding, I’d want to set regularizing priors on:

  • beta1 - beta2
  • beta1 - beta3
  • beta1 - beta4
  • beta2 - beta3
  • beta2 - beta4
  • beta3 - beta4

This post on CV shows this can be accomplished using methods related to “fusion” of categorical levels.

Thanks in advance for any suggestions. I haven’t been able to find any literature, blogs, etc. of this in Stan or brms.

I think you can accomplish something like that by using a different encoding for the levels.

  • X1 = 1 for level 1, -1 for level 2, 0 otherwise
  • X2 = 1 for level 1, -1 for level 3, 0 otherwise

In the regression model y \sim \alpha_1 X1 + \alpha_2 X2 .... You can then put regularizing priors on the \alpha's. The problem is that you cannot include all the contrasts because they are linear combinations of each other. That also means that not all contrasts are equally regularized.

Another approach would be to use the levels as a random factor: y ~ 1 + 1 | level and fix the standard deviation to the required regularization. I am not fluent in brms to be sure that you can fix the standard deviation but I think you can. That way the deviation from the grand mean for each level would be regularized and I think that implies that all contrasts would be equally regularized as well.

1 Like

Hi Kyle -
Is the aim to control for multiple comparisons? If so, if you model it with a heirarchical structure, as Stijn pointed out, you get automatic shrinkage and thus, avoiding false positives.

Thanks for the suggestions, @stijn and @dilsherdhillon . I’ll add a bit more context. I’m running this model for hundreds of proteins and want to heavily regularize the contrasts. The idea of “fusion” as discussed in the link in my original post is that some groups may be essentially no different, so the model-estimated means should be very close or equal for those groups. This is what I’m trying to accomplish, and one way of doing it is to put regularizing priors on all contrasts. I think this can be accomplished in Stan (Gelman post) but I was wondering if it’s possible in brms.

Normally I would fit a varying intercepts model like y ~ 1 + (1 | level). The problem is that the levels have very different sample sizes, so the groups are unequally regularized. This is usually a beneficial feature of a multilevel model, but in this case, I’m interested in all pairwise contrasts, so this can actually result in some contrasts not being regularized. Simulated example below:

  • y | X \in {GroupA, GroupB, GroupC} \sim N(3.5, 1)
  • y | X \in {GroupD, GroupE} \sim N(6, 1)

Sample sizes are N=50 for Groups A, B, C, E and N=5 for Group D


sim_data <- data.frame(y = c(rnorm(50*3, 3.5, 1),
                             rnorm(50+5, 6, 1)),
                       X = rep(paste0("Group", LETTERS[1:5]),
                               c(50, 50, 50, 5, 50))) %>% 
  mutate(X = as.factor(X))

Difference in sample means of Group E and Group D is 0.605.

With a HalfNormal(0, 1) prior on the scale parameter.

sim_fit_multilevel <- sim_data %>% 
  brm(y ~ 1 + (1 | X),
      prior = set_prior("normal(0, 1)", class = "sd"),
      seed = 123,
      chains = 1,
      iter = 3000)

Estimate the difference in model-estimated means:

sim_data %>% 
  distinct(X) %>% 
  add_fitted_draws(model = sim_fit_multilevel) %>% 
  ungroup %>% 
  select(X, .draw, .value) %>% 
  spread(key=X, value=.value) %>% 
  median_qi(contrastDE = GroupE - GroupD)

Estimated difference in means of Group E and Group D is 0.702 (-0.172, 1.54), which is larger than the difference in sample means.

With a tighter HalfCauchy(0, 0.1) prior on the scale parameter.

sim_fit2_multilevel <- sim_data %>% 
  brm(y ~ 1 + (1 | X),
      prior = set_prior("cauchy(0, 0.1)", class = "sd"),
      seed = 123,
      control = list(adapt_delta = 0.9),
      chains = 1,
      iter = 3000)

Estimate the difference in model-estimated means:

sim_data %>% 
  distinct(X) %>% 
  add_fitted_draws(model = sim_fit2_multilevel) %>% 
  ungroup %>% 
  select(X, .draw, .value) %>% 
  spread(key=X, value=.value) %>% 
  median_qi(contrastDE = GroupE - GroupD)

Estimated difference in means of Group E and Group D is 0.715 (-0.143, 1.57), again larger than the difference in sample means.

Hi Kyle -
The estimated means from a multilevel model will always be slightly different from the data you’ve simulated - that’s what partial pooling aims to accomplish.

  1. The priors you have chosen are less regularizing then you might think. Especially the Cauchy prior is very counterintuitive. The sd between the group means in the data is
sd(c(rep(3.5, 3), rep(6, 2))
+ )
[1] 1.369306

The half-normal prior has a mean of

mean(abs(rnorm(1000, 0, 1)))
[1] 0.8150837

The half-cauchy prior has a mean of

mean(abs(rcauchy(1000, 0, 0.1)))
[1] 1.041561

(Actually, the mean of the cauchy is not well defined but the simulated draws give an indication of what I mean).

  1. I am less sure about this. However, I don’t think you can necessarily look at Group E and Group D on itself. The random effects model assumes that all groups are interchangeable. The rough intuition is the random effects structure will force the group means in a normal distribution and pulling them towards the grand mean. The simulated data however is bimodal (3.5 or 6). I suspect that the random effects structure will pull the groups A, B, and C closer to the groups D, E but forces some distance between A, B, and C (and between D and E) because of the implied normal distribution.

If your real data has a structure closer to the simulated data and you still want to get better regularization, you could think whether you can put more structure on the data. For instance, do the groups A, B, C belong to one family, and the groups D and E to another family. You can then work with a three-level hierarchy (family -> group -> observation).

I hope that helps.

Your suggestion regarding a three-level hierarchy is interesting. The trouble is that we don’t have any prior information on which groups should cluster together in a “family”, especially because this model is applied to hundreds of proteins and the clusters will differ by protein.

This is why I thought the “fusion” models would work well. This paper describes one way of accomplishing this is to put spike and slab priors on all contrasts. According to my understanding, you can’t use spike and slab priors in Stan, so I would like to try a product-normal and/or horseshoe prior on each contrast. I’m sure this is reasonably straightforward in Stan, but most of my experience is using brms, which was the motivation for my original question.

This is way out of my comfort zone. It would probably be good if you make a new topic where you summarise this thread. The way I see it is that you first have to make a decision whether the distribution of contrasts are expected to follow some kind of normalish distribution or if they are more like a lot of zeros and some contrasts are different from zero.

In the latter case, a horseshoe prior or similar prior is more appropriate but I think is also computationally more demanding and requires good (sized) data. If you go with the horseshoe prior, ideally you would like to go with a horseshoe like prior on the random effects. I don’t know whether that feasible theoretically and I am relatively sure it’s not going to be in brms. The other option is to define the contrasts as seperate variables but then you can’t put them all in the model (see my first response). You can then put a horseshoe prior on all the contrast variables. This should be possible in brms. The drawback of this approach is that the contrasts that are not included will have an implicit prior that is different from the prior on the contrasts that are included. Which might be fine. Say you have three groups and you use the following priors.

\beta_1 - \beta_2 \sim N(0, 1) \\ \beta_2 - \beta_3 \sim N(0, 1)

then the implicit prior for the third contrast is

\beta_1 - \beta_3 = \beta_1 - \beta_2 + \beta_2 - \beta_3 \sim N(0, \sqrt{2})

Again, this is all my opinion from very shaky first principles. You should make a new topic and you might get smarter eyeballs on your problem.