How to find theoretical Beta parameters boundaries in multinomial model?

Hello everyone,

I hope my question has not been asked and answered in any previous posts (I haven’t found it, so hopefully this won’t be redundant).

My question is about the estimation/interpretation of beta parameters of multinomial logistic regressions, in particular because I found (see below) they don’t change as much as I thought they would given two contrasting fake trends I made up. But first, a bit of context to explain the data and problem.

Briefly, I am investigating comments made by observers on insect pictures collected within the framework of the citizen science project Spipoll (English description here).

I would like to use noninformative weak priors following the definition in (Gelman and Hill, 2007), p355: “for a prior distribution to be noninformative, its range of uncertainty should be clearly wider than the range of reasonable values of the parameters." also suggested by email by Solomon Kurz and in a paper by Lemoine (2019).

However, I have some trouble to find the theoretical boundaries of beta parameters in my categorical model, so I don’t know what could be the “range of reasonable values of the parameters”.
Thus, to find out, I’ve modified my dataset in two contrasting versions that include more or less extreme trends (stronger than in the real data), then fitted a model on these two fake datasets and see how strong beta parameters can get.

The response variable of the model is the category of comments (seven levels: Recherche, Connaissance, Cycle, Contexte, Descriptif, Perspective, Curiosite), with the category Research (Recherche) set as the reference level.
The only fixed effect is the year each comment was posted (from 2010 to 2017, coded as a continuous variable from 1 to 8).
[Note that in the final model will be included a random effect for the identification code of the participant, as some participants contributed with multiple comments (i.e. dependence in the data that need to be accounted for)].

In fake data #1: the Research category makes up 4% of the comments in 2010, and gradually raises to 95% in 2017. See this 1st figure (Recherche category is on top in blue-green)

In fake data #2: the Research category also start at 4% and end at 95%; however, the Knowledge category (“Connaissance”, in orange) starts at 93% and ends at <1%. Thus, I here expect the effect of year on “Connaissance” to be the strongest. Indeed, category of comments Connaissance is emptied year after year to the sole benefit of Recherche comments. See this 2nd figure:

However, here are the estimates obtained by the two models fitted to these two datasets:
Results with fake #1 dataset:

Estimate Est.Error Q2.5 Q97.5
muConnaissance_Intercept 0.7351239 0.25745051 0.2319605 1.2389316
muCycle_Intercept 0.2217930 0.26759794 -0.3095479 0.7412367
muContexte_Intercept 1.1219188 0.19293080 0.7495230 1.5029422
muDescriptif_Intercept 1.4324363 0.22900320 0.9826459 1.8892241
muPerspective_Intercept 1.5779264 0.20379257 1.1778803 1.9745000
muCuriosite_Intercept 1.3196273 0.19046166 0.9520178 1.6968824
muConnaissance_year18 -0.6842431 0.06460922 -0.8138866 -0.5602087
muCycle_year18 -0.5630291 0.06321881 -0.6877448 -0.4404182
muContexte_year18 -0.5594056 0.04349774 -0.6474972 -0.4767574
muDescriptif_year18 -0.7824418 0.05981112 -0.9013440 -0.6674225
muPerspective_year18 -0.7227373 0.05059039 -0.8229390 -0.6235051
muCuriosite_year18 -0.6070640 0.04416343 -0.6951696 -0.5224079

Results with fake #2 dataset:

Estimate Est.Error Q2.5 Q97.5
muConnaissance_Intercept 3.0045764 0.14385464 2.7221784 3.28827696
muCycle_Intercept -3.1001033 0.73311313 -4.6282497 -1.75276506
muContexte_Intercept -2.3180013 0.52993671 -3.4016725 -1.32090800
muDescriptif_Intercept -3.6003164 0.81856338 -5.3035236 -2.11048010
muPerspective_Intercept -2.2530296 0.58229061 -3.4710633 -1.15387281
muCuriosite_Intercept -2.3287323 0.51415007 -3.3665910 -1.36098108
muConnaissance_year18 -0.6978025 0.03133402 -0.7591399 -0.63567196
muCycle_year18 -0.2595350 0.14854906 -0.5579470 0.02516362
muContexte_year18 -0.2836348 0.10936605 -0.5028316 -0.07380743
muDescriptif_year18 -0.1699583 0.15681366 -0.4779084 0.14062106
muPerspective_year18 -0.3451949 0.12414934 -0.5854095 -0.10307424
muCuriosite_year18 -0.2578751 0.10380112 -0.4635130 -0.05404364

As you can see, the both estimates for the effect of year on category of comments Connaissance (see muConnaissance_year18) are very similar.

So, finally :-), my question is:
why are the two estimates so similar despite the strikingly different trends?

Thanks for reading until here, I hope it is clear enough. I’m looking forward hearing any thoughts on this.
Best,
Nicolas D.


See below for the code of the models:

fit_fake1<-brm(classe ~ year18, family = categorical, iter = 8000, chains = 3
    ,data = fake1, prior = kurz_vague_priors)

fit_fake2<-brm(classe ~ year18, family = categorical, iter = 8000, chains = 3
    ,data = fake2, prior = kurz_vague_priors)

with

kurz_vague_priors<-c(
    prior(normal(0, 20), class = b, dpar = muConnaissance),
    prior(normal(0, 20), class = b, dpar = muContexte),
    prior(normal(0, 20), class = b, dpar = muCuriosite),
    prior(normal(0, 20), class = b, dpar = muCycle),
    prior(normal(0, 20), class = b, dpar = muDescriptif),
    prior(normal(0, 20), class = b, dpar = muPerspective)
)

---- References used above:
Gelman, A., Hill, J., 2007. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511790942
Lemoine, N.P., 2019. Moving beyond noninformative priors: why and how to choose weakly informative priors in Bayesian analyses. Oikos 128, 912–928. https://doi.org/10.1111/oik.05985

The categorical distribution in Stan uses the softmax link: P(Y = y | X = x) = \frac{exp(\alpha_y + \beta_y \times x)}{\sum_{i = 1}^Y exp(\alpha_i + \beta_i \times x)}. The first category is treated as the reference, with \alpha_1 and \beta_1 constrained to 0. Consequently, the \beta_y terms can be interpreted in the same way as a logistic regression model where you only included two groups. You can exponentiate the \beta_y terms to get odds ratios compared to the reference category.

As for your particular observation about the similar \beta terms, I suggest you look at the predictions from the two models and see how they compare to your observed data. I copied your parameter point estimates and generated predicted data plots and they look very similar! The main difference is that Connaissance is predicted to be much smaller in 2010 for the first fake dataset. That tells me that the proportional change in odds year-on-year for Connaissance vs Recherche is relatively constant in both cases. They just have very difference baseline starting odds.

And, of course, if you can share your fake data, that might help as well.

library(dplyr)
library(tidyr)
library(ggplot2)

group_names <- c('Recherche', 'Connaissance', 'Cycle', 'Contexte', 'Descriptif', 'Perspective', 'Curiousity')

fake1 <- data.frame(
  group = group_names,
  intercept = c(0, 0.735, 0.221, 1.122, 1.432, 1.577, 1.32),
  slope = c(0, -0.684, -0.563, -0.559, -0.782, -0.722, -0.608)
) %>%
  mutate(dataset = 1)

fake2 <- data.frame(
  group = group_names,
  intercept = c(0, 3.00, -3.10, -2.32, -3.60, -2.25, -2.33),
  slope = c(0, -0.698, -0.260, -0.284, -0.170, -0.345, -0.258)
) %>%
  mutate(dataset = 2)


fake_df <- bind_rows(fake1, fake2) %>%
  mutate(group = ordered(group, levels = group_names)) %>%
  mutate(year = list(2010:2017)) %>%
  unnest() %>%
  mutate(year18 = year - 2009) %>%
  mutate(logit = intercept + slope * year18)

# Replicating bar plots
fake_df %>%
  group_by(dataset, year) %>%
  mutate(p = exp(logit) / sum(exp(logit))) %>%
  ggplot(aes(year, p, group = group, fill = group)) +
    geom_bar(stat = 'identity') +
    scale_fill_brewer(palette = 'Set1') +
    theme_bw() +
    theme(panel.grid = element_blank()) +
    facet_wrap(~ dataset)

# Year-on-year odds ratio, Recherche vs Connaissance
# should be flat by design, but may not be in the data!
fake_df %>%
  filter(group %in% group_names[1:2]) %>%
  group_by(dataset, year) %>%
  mutate(p = exp(logit) / sum(exp(logit))) %>%
  ungroup() %>%
  mutate(odds = p / (1 - p)) %>%
  arrange(dataset, group, year) %>%
  group_by(dataset, group) %>%
  mutate(odds_change = odds / lag(odds)) %>%
  ggplot(aes(year, odds_change, group = group, fill = group)) +
    geom_line() +
    scale_fill_brewer(palette = 'Set1') +
    theme_bw() +
    theme(panel.grid = element_blank()) +
    facet_wrap(~ dataset)

Just be careful. @andrewgelman likes to make up his own definitions for concepts like “identifiability” and “uninformative”. As the Wikipedia says in its section on “uninformative priors”,

“The term “uninformative prior” is somewhat of a misnomer.”

@andrewgelman has written about this extensively in the context of diffuse (inverse) gamma priors for (covariance) precision parameters in hierarchical models.

By “beta parameters of multinomial logistic regressions”, you mean estimating coefficients or sampling form their posteriors? And what do you mean by “theoretical boundaries of beta parameters”? And what do you mean by “strong beta parameters”?

I’m afraid I don’t know brms so don’t understand that part of what’s going on here. Hopefully someone who knows the brms API can help more. I din’t understand the model data well enough from the brms to help more—it looks like you’re getting wildly different estimates for both coefficients with a few that are similar. What you can do is look at the residuals of the fit without those predictors and see how similar they are—that’s what’s going to determine the values of the coefficient.

Thank you @simonbrauer for your answer.

First, I now attach my two fake datasets (should have done that right from the start, sorry :-) )
fake1.txt (33.8 KB)
fake2.txt (36.5 KB)

The link from By to odd-ratios was not evident to me in the case of a multinomial model, so thanks for the clarification.
I am not too familiar with odd-ratios, but I get they can’t be below 0 and are not upperly bounded, in theory (the risk of getting drunk after drinking 1L of whisky compared with 1L of tap water is probably in the 10,000’s or more ?).
But in a multinomial context such as my dataset, I wonder how high could reach an odd-ratio in theory, and the By in particular. What sort of fake dataset would lead to odd-ratios closer to 0 or higher than for example 5? Is it possible?

Also, it was actually a great idea to plot model predictions and see how it compared to my fake datasets. Indeed, the predictions are quite good (except that 1st year in fake1 data).

Thanks @Bob_Carpenter your answers.

I mean the estimated coefficients for the effects of year18 on the different categories (Estimate of year18 - i.e. not the intercept - in the tables of my original post ).

Sorry if that wasn’t clear enough. @simonbrauer partially answered it by recalling that exponantiating the By leads to odd-ratios. So I meant, what is the theoretical range of values for the By in a multinomial model? (For example, if By already were odd-ratios, then they could not have been <0 ; but that’s not the case and, in theory, it seems there are no limits to the By).
Now, in practice (and for defining my priors), I wonder what values they could reach given a dataset similar to mine (multinomial context, with the response variable that is a 7-levels categorical variable).
By ‘strong’, I mean values of By that would translate to odd-ratios close to 0 (e.g. By=-5) or 5 (i.e. By = 1.61).
I wonder what type of trend in a dataset similar to mine could lead to such By: what fake data should I simulate?

In terms of priors then, using N(0,2) might be a little too narrow, because By values of -5 (meaning an OR = 0.007) are very “unlikely”. Perhaps using N(0,5) would be better.