Priors for multinomial logistic regression in brms

Hi everyone,

I’m trying to run a multinomial logistic regression model in brms and am struggling to set priors. I only have 3000 data points (actually a lot of data in my field, but not for statistical modelling), so I’m worried that if I use flat priors, the estimates are going to be skewed to make the differences between categories look smaller than they actually are. As I’m still trying to get the basics down, my model at this point is very simple:

mdl ← brm (A ~ B + (1 + B | participant), dataset, family = categorical(), cores = 4, backend = “cmdstanr”)

with A being the categorical outcome variable with 12 levels,

B being a categorical predictor variable with 2 levels,

and participant also being a categorical variable.

The random effect is intended to give both random slopes and intercepts; the backend specification makes the model run a little bit faster (sidenote: I feel like my model is still running pretty slowly considering the size of my data set and the simplicity of the model, but that’s an issue for another post perhaps.)

As I said, I would like to add somewhat informative priors to the model. There is no previous research I can lean on for priors (working on a rather specific topic within a small field) and my expectations are rather vague (of the kind ‘this category of A likely appears more often under condition 1 of B than condition 2 of B). Since prior specification needs actual numbers and a distribution, I’m not really sure how to even approach that with a categorical outcome variable. I also want to eventually build the model up to include more predictor variables (mostly categorical) and random effects (all categorical, one ordered).

Any advice or pointers towards other studies dealing with prior specification for categorical outcome variables would be much appreciated!

Operating system: macOS Ventura 13.6.1
R Version: 4.2.1 (note: I’m aware that this is not the most recent R Version, but brms was not working with 4.4.1, so I reverted back to the last version I successfully used brms with.)
R Studio Version: 2024. 04.2+764
brms Version: 2.21.0
cmdstanr Version: 0.8.1 (CmdStan: 2.35.0)

1 Like

What do you mean by “skewed” here? If you have truly flat priors (versus, say, a gamma(0.001, 0.001) or normal(0, 10000), then everything is driven purely by the likelihood.

I’m not sure what kind of information you have, but it sounds like it’s about the magnitude of coefficients for some of your covariates. That is, there’s a covariate for B where one value is expected to have a higher value than another for a particular outcome. That’s usually easy to specify, but I don’t know how to do it in brms.

There’s a short discussion in the User’s Guide, but it’s mainly aimed at identification. The traditional approach is to set one of the category’s linear predictor to 0 before the softmax. You can also identify by overparameterizing and adding a prior, which makes it easier to apply symmetric posteriors.

Usually, people use shrinkage priors for this kind of model to pull them toward zero.

1 Like

Hi, thanks for your answer!

My understanding (mainly based on Nicenboim, B., & Vasishth, S. (2016). Statistical methods for linguistic research: Foundational Ideas—Part II. Language and Linguistics Compass, 10(11), 591–613. https://doi.org/10.1111/lnc3.12207) is that since posteriors are means between priors and likelihood, using flat priors when there is not a ton of data pulls all of the posteriors towards the same mean and thus may make effects appear smaller than they actually are (p. 8). Does that not apply here?

Yes, that sounds right! I would like to try weakly informative priors in regard to magnitude and sign of the effect. If you have a non-brms-specific source where someone uses these kinds of priors for a categorical outcome, I would also appreciate that. My background is in frequentist modelling, so my knowledge regarding priors is lacking in general. And if I have some idea of what the numbers should look like, I may be able to translate that into brms :)

Based on the results I get from the model (with flat priors), I believe brms already does this anyway. The terminology is different than I’m used to, but you mean that a baseline is set that all other coefficients are then calculated in relation to, right? That’s already what my results look like.

As for the overparameterizing, I’ll keep it in mind if I run into problems with non-identifiability, but I don’t think that helps me with figuring priors at this point (unless I misunderstood what you meant).

I’ll look into shrinkage priors, thank you!

This is a reasonable way to think about penalized (regularized) maximum likelihood estimators like ridge or lasso (empirical Bayes, which contrary to name, is a prior point estimation technique).

In Bayesian stats, it’s more helpful to think of the prior density, likelihood, and posterior density, where the posterior is the product of the prior and likelihood.

It’s actually the other way around. A prior is going to pull posterior densities toward the prior density—a (improper) flat prior presents no pull at all. It’s also important to keep in mind that priors come with parameterizations and scales. A prior that’s uniform on probability is not uniform on the log odds (it’s logistic). I would suggest actually trying this and looking at the difference in posterior. Try increasingly strong priors starting from improper uniform priors and see what it does to the posterior (this is known as a “prior sensitviity analysis”).

Statisticians like Aitchison’s approach using the isometric log ratio transform. But otherwise, the softmax operation is just the multi-logit link you see in traditional frequentist approaches to multinomial logit.

Identifiability of the posterior (not the lkelihood) is intrinsically tied up with priors. For instance, the ridge is a prior that can help identify a model that’s not identifiable in the likelihood. For example, if the likelihood for scalar observations y_n is p(y_n \mid \mu, \sigma) = \text{normal}(y \mid \mu_1 + \mu_2, \sigma), then \mu_1, \mu_2 are not identified, only \mu_1 + \mu_2 is. An improper uniform prior over \mu_1 and \mu_2 will produce an improper posterior, whereas a normal prior will identify the posterior in the sense of making it a proper density.

1 Like

A prior sensitivity analysis does seem like the way to go. I attempted one before but ran into issues with the older R version I was using and the compatibility of the required packages. However, your reply inspired me to try again and it seems that the newest R version and newest brms version are now getting along, so I am hopeful that I will be able to run a prior sensitivity analysis. Thank you again!