Problem in multilevel (hierarchical) multinomial logistic regression (with brms)

I have posted a teaser of the issue at my blog, copied below.

The data

The predicted variable is a categorical response, named resp with levels ‘1’, ‘2’, ‘3’, and ‘4’ (nominal labels, not numerical values). The predictor is a categorical/nominal variable named group, with levels ‘A’ through ‘K’.

Notice these aspects of the data:

  • The proportion of response ‘2’ equals response ‘3’ within each group, and across all groups. Specifically, in every group, p(‘2’) = p(‘3’) = 0.25.

  • The proportions of responses ‘1’ and ‘4’ are symmetric reflections of each other, with p(‘1’|A) = p(‘4’|K), p(‘1’|B) = p(‘4’|J), and so on.

Multilevel (hierarchical) multinomial logistic regression in brms

Because of the multiple groups, this is a natural setting to try a hierarchical model that shares information across groups, and will provide shrinkage across groups. Because of the symmetry in the data, the hierarchical model should symmetrically shrink the response ‘1’ proportions closer to 0.25, and do the same for the response ‘4’ proportions.

I call brm() with the usual hierarchical formula = resp | trials(n) ~ 1 + (1 | group).

We can get the posterior predictions and make a plot:

Notice these aspects of the posterior predictions:

  • Contrary to the data, the proportion of response ‘2’ is not the same across groups, and the proportion of response ‘3’ is not the same across groups.

  • Contrary to the data, within every group, p(‘1’) = p(‘2’) = p(‘3’).

  • Contrary to the data, the proportion of response ‘1’ does not symmetrically mirror response ‘4’.

In the full document linked below, I explain why this happens and I propose a solution. Has this problem been pointed out before? Has this solution been proposed before?

To see the full description of the issue, click the link below to download the HTML file. Then find the downloaded HTML file on your computer and double-click it to open it in your browser:

https://drive.google.com/uc?export=download&id=1z_hGTzkkIlMJ0Tk2ONCH96bZh10l0gMr

4 Likes

My impression is that this is an example of a more general issue - that in the standard formulation of multinomial logistic the prior is not invariant to reordering of the categories. You get a “correct” answer with the fixed effects and with a flat prior, because the flat prior is symmetric - I would expect that using any other prior would lead to a similar (though possibly less pronounced) problem. Note that the hierarchical structure can be thought of as being a part of the prior here. This is IMHO relatively well known.

Adding a dummy category is a nice (and simple!) hack, but there are other ways to handle this to force the invariance. One would be that instead of having a reference category with all zero coefs we parametrize the categories with a sum-to-zero vector - this still has just K - 1 dimensions but we can now have parametrizations that are invariant to reordering (e.g. like the one used for sum-to-zero vector in Stan Constraint Transforms ). This would however be much harder to get working with brms (presumably one could do so with a suitably hacky custom family)

There also appears to be some literature on “symmetric multinomial logistic regression” which appears to closely related to the Aitchinson geometry, and centered/additive log ratio transform which is in turn how new versions of Stan implement the simplex transform. Once again, this is a way to obtain K probabilities from K - 1 linear predictors that is invariant to reordering.

The glmnet package goes one step further and estimates all K coefficients and relies on regularization to avoid non-identifiability (see the glmnet paper). If I understand them correctly, this is not unlike random intercepts (which are also identified only thanks to regularization).

3 Likes

You can set multinomial(refcat = NA) to get a model for each category separately (though I probably wouldn’t run this with completely flat priors):

library(tidyverse)
library(brms)
library(tidybayes)
library(viridisLite)
bar_fill_colors <- viridis(n = 5)[c(1,3,4,2,5)]


# Matrix of response counts:
resp = t( matrix(
  c( 11:1 ,
     rep(6,11),
     rep(6,11),
     1:11 ) , nrow = 4, byrow = TRUE,
  dimnames = list( resp = as.character(1:4) ,
                   group = LETTERS[1:11] ) ) )
# Assemble into data frame for brms:
nomnom <- tibble( 
  group = rownames( resp ) ,
  resp = resp ,
  n = rowSums( resp ) )

brm_nomnom_hier <-
  brm( data = nomnom ,
       formula = resp | trials(n) ~ 1 + (1 | group) ,
       family = multinomial(link = logit, refcat = NA) ,
       warmup = 200, iter = 200 + 2500*2, thin=2, chains = 4, cores = 4, 
       backend = "cmdstanr",
       seed = 47 )

newdata_nomnom <- 
  nomnom |> select( group ) |> mutate( n = 1 ) 

posterior_epred(brm_nomnom_hier, newdata = newdata_nomnom) |> 
  posterior::rvar()
#> rvar<10000>[11,4] mean ± sd:
#>       1             2             3             4            
#>  [1,] 0.38 ± 0.086  0.25 ± 0.055  0.25 ± 0.056  0.13 ± 0.058 
#>  [2,] 0.35 ± 0.081  0.25 ± 0.055  0.25 ± 0.054  0.15 ± 0.059 
#>  [3,] 0.32 ± 0.078  0.25 ± 0.052  0.25 ± 0.053  0.18 ± 0.061 
#>  [4,] 0.30 ± 0.074  0.25 ± 0.050  0.25 ± 0.051  0.20 ± 0.061 
#>  [5,] 0.27 ± 0.071  0.25 ± 0.050  0.25 ± 0.050  0.22 ± 0.064 
#>  [6,] 0.25 ± 0.067  0.25 ± 0.050  0.25 ± 0.050  0.25 ± 0.067 
#>  [7,] 0.22 ± 0.064  0.25 ± 0.051  0.25 ± 0.051  0.27 ± 0.070 
#>  [8,] 0.20 ± 0.062  0.25 ± 0.052  0.25 ± 0.051  0.30 ± 0.074 
#>  [9,] 0.18 ± 0.060  0.25 ± 0.053  0.25 ± 0.052  0.32 ± 0.076 
#> [10,] 0.15 ± 0.060  0.25 ± 0.055  0.25 ± 0.053  0.35 ± 0.082 
#> [11,] 0.13 ± 0.058  0.25 ± 0.057  0.25 ± 0.056  0.37 ± 0.087

brm_nomnom_hier_epred <- 
  epred_draws( object = brm_nomnom_hier , 
               newdata = newdata_nomnom , seed = 47 )
# plot:
brm_nomnom_hier_epred |> 
  group_by( group, .category ) |>
  summarize( pp = mean( .epred ) ) |>
  rename( resp = .category ) |>
  mutate( model = "Hierarchical" ) |>
  ggplot( mapping = aes( x = group , 
                         fill = resp ,
                         weight = pp ) ) +
  geom_bar( position = "fill" , color = "white", width = 0.70  ) +
  scale_fill_manual( values = bar_fill_colors ) +
  facet_wrap( ~ model ) +
  labs( x = "Group" , y = "Proportion of Responses" ) 

7 Likes

Very interesting

For posterity, see also this comment from @g.frischkorn regarding his approach in the {bmm} package: @gidon-frischkorn.bsky.social on Bluesky

1 Like

Thanks, @mattansb , for the detailed reply involving multinomial(refcat = NA). Very helpful.

All: Follow-up questions…

When refcat = NA, the underlying model is unidentifiable, and I’m surprised that a modestly constraining prior gives useful results. I’m wondering, Is it generally true that a modestly constraining prior works, or is this just a lucky example? If this approach usually provides useful results, why not make it the default approach?

Is my proposed solution, using a dummy reference category of all 1’s (which yields an identifiable model, allows flat priors, and treats all real-data categories equivalently), viable for a wide range of applications, or do people foresee it introducing problems or distortions?

In any case, it seems the default in brms, for these sorts of applications, should be cautioned against, even if it’s a default inherited from maximum likelihood estimation. ?

I’m not sure - I personally tend to use relatively informed priors, so I haven’t come up against any difficulties with such models (not that I use them often…).

My guess would be because of how multinomial models are parameterized in other packages/software. But maybe Paul has another answer.

Introducing a “dummy” class is the same as setting refcat = NA - you can see with make_stancode() that the only difference is:

// for dummy ref level
for (n in 1:N) {
  mu[n] = transpose([0, mu1[n], mu2[n], mu3[n], mu4[n]]);
}
// for refcat = NA
for (n in 1:N) {
  mu[n] = transpose([mu1[n], mu2[n], mu3[n], mu4[n]]);
}

(Everything else is in the code exactly the same - the priors, the likelihood…)

The only difference is that you’re adding into the model a little regularization by setting N=1 for the reference dummy class (how would this work with non-aggregated / cluster observation? Adding rows to the data?).

5 Likes

What would be the correct application of the dummy reference solution in the case of unequal trials? Or is that irrelevant?

Hi John,

You mention that

I’m curious, was this result just a coincidence for this particular data collection, or is this pattern inherent in your data generating mechanism? If it is inherent, then I’m not sure there is any randomness in your data which would mean that there is nothing to model. So then maybe a better question is, what is your goal of modeling here?

Thanks again to @mattansb for the detailed replies. Thanks also to the other responders in this thread.

I’ve now worked through a variety of related examples (not posted), and have come to this conclusion:

When the predicted variable is categorical (named resp in a matrix of counts) and the predictor is categorical (named group), the most robust[1] approach in brms is

formula = resp | trials(n) ~ 0 + group for non-hierarchical (suppressing the intercept with ~ 0 + makes it easier to specify the prior — this was not featured in any of the examples),
formula = resp | trials(n) ~ 1 + (1 | group) for hierarchical,
both with
family = multinomial(link = logit, refcat = NA) (yes, that’s NA for refcat) and a (moderately) informed prior. Check that the prior is really doing what you intend.

The approach I proposed, using a dummy reference category, also seems to work pretty well! But it introduces subtle symmetric effects that have not yet been thoroughly characterized (@mattansb referred to it as a form of regularization).

Thanks again to all!


  1. By “robust” I mean not introducing asymmetries across response categories, mathematically clear, and reasonable execution time. ↩︎

2 Likes

@eyanchenko The data were specifically contrived to be symmetric so the asymmetry introduced by the model would be obvious.

@amynang The dummy reference is the same (uniform) regardless of the real data.