Priors and the reference level in multinomial models

Please also provide the following information in addition to your question:

  • Operating System: Win 10
  • brms Version: 2.1

Hi there,

I’m fitting a multinomial model (based on prior helpful advice). We have a set number of measurements (157) on subjects, with each measurement placing the subject in one of 10 states. Right now I’m using aggregated count data, much like the multinomial example from Paul here. I may eventually model each timepoint with a subject-level effect, but I’m starting somewhat simpler.

I’ve attached a snippet of the data as a reference in case there is some important difference between it and the dummy data:

Let’s make some dummy data as per Paul’s post,but set up to have different reference levels:

N <- 15
dat <- data.frame(
  y1 = rbinom(N, 50, 0.9), y2 = rbinom(N, 150, 0.5), 
  y3 = rbinom(N, 150, 0.2), x = rnorm(N)
)

dat$size <- with(dat, y1 + y2 + y3)
dat$ya <- with(dat, cbind(y1,y2, y3))
dat$yb <- with(dat, cbind(y2, y1, y3))
dat$yc <- with(dat, cbind(y3, y2, y1))

I noticed, when sampling from the prior, that my priors didn’t seem to impact the mean/median for the prior predictive samples the way I thought it would

I thought it was related to the fact that my priors weren’t applying to the reference category, and as such I set up the data to use different reference levels so I could fit difference PP models

prior <- prior(normal(0, 2), "Intercept")

fit1 <- brm(bf(ya | trials(size)  ~ 1), data = dat, 
           family = multinomial(), prior = prior, sample_prior = "only")

fit2 <- brm(bf(yb | trials(size)  ~ 1), data = dat, 
            family = multinomial(), prior = prior, sample_prior = "only")

fit3 <- brm(bf(yc | trials(size)  ~ 1), data = dat, 
            family = multinomial(), prior = prior, sample_prior = "only")

Here are some results:

a<-dat %>%
  add_fitted_draws(fit1)  %>%
  ggplot(aes(x = .value, y = .category)) +
  stat_pointintervalh(.width = c(.66, .95))
b<-dat %>%
  add_fitted_draws(fit2)  %>%
  ggplot(aes(x = .value, y = .category)) +
  stat_pointintervalh(.width = c(.66, .95))
c<-dat %>%
  add_fitted_draws(fit3)  %>%
  ggplot(aes(x = .value, y = .category)) +
  stat_pointintervalh(.width = c(.66, .95))

a/b/c+plot_annotation(tag_levels = ‘1’,tag_prefix = "fit ")

The reference level is at the bottom each time, and always has a lower median and a narrower distribution. In my mind it makes sense since the others are defined in relation to the reference, so they would have greater variance?

However, In my actual data the distribution for the reference is often narrower and with a lower median than I would predict a priori. For instance, I have 157 timepoints during which a subject will be assigned to one and only one state, so a priori I may expect a median of around 15-16, with values above 100 being less likely but certainly not impossible (someone could be in the same state all 157 times). But my priors for the test data I attached look like this (with the only prior being normal(0, 2) on “Intercept”):

That is not a reasonably prior on the reference category, and all of them are centered too low. I tried increasing the sd on the normal prior and decreasing it and while it changes the spread they are all still centered rather low.

Questions
How can I specify a set of priors that will center the prior predictive interval higher?
How can I increase the width of the distribution on the reference level, seeing as the current one is far too narrow?
Given the fact that the reference is so different, is it typical for investigators to run the model with various references and then average over models?

Thanks so much

Hugo

EDIT: forgot to attach test_subset.csv (396 Bytes)

Wait, obviously if there are more extreme values in some of the levels it would imply lower levels in the rest, so wider intervals would probably drive the medians of the priors to the left/lower. Duh.

Also, I found family = multinomial(refcat = NA) which ensures equal priors across all levels despite being slower to sample/more likely to be unidentifiable.

In this context I am interpreting priors on the intercept as (super simplified) the P(Y==a) as opposed to P(Y==not-a). So putting really narrow priors around 0 means the priors across levels look pretty similar and are centered around TRIALS/CATEGORIES (more or less).

Rplot02

Whereas wider intervals look like the ones in my post:
Rplot01

It is not possible to set a prior such that the median will be where I thought I wanted it (~16) AND the spread as wide as I wanted (tails up to ~100) because that is nonsensical.

I would appreciate input RE my understanding but it seems my question was unidentified!

Hi Hugo, I am sorry I am a little overwhelmed by the amount of details here and I am not sure I can be very helpful right now. A general point I have is that perhaps you shouldn’t be too worried about the prior if it doesn’t influence the posterior too much, which you can try out by setting priors of different scale and comparing the posterior inference. Setting priors in general, and specifically for categorical/multinomial is a non-trivial topic and I am not sure if I can give you any good advice out of the box.

Thanks @paul.buerkner . I think I figured out most of the reasons I was confused.

My understanding is that if I have a multinomial model with X discrete outcomes and a fixed number of trials (N) the fact that the prior predictive draws have to add up to the number of trials means that a larger variance/spread leads to a lower mean. In other words, if my prior is very tight around zero then there’s a fairly even spread between the outcomes and the mean for each outcome would be more or less N/X. Whereas, if the prior has a larger variance with some very high values being possible, the predictive intervals will have a lower mean with a long tail towards the right/higher values.

I have a simpler follow up:
If I fit a multinomial model with refcat=NA, what is the appropriate transformation to get from the intercept values in the posterior draws to the expected probability or expected counts?

Here’s an example based on your prior example:

N <- 15
dat <- data.frame(
  y1 = rbinom(N, 50, 0.9), y2 = rbinom(N, 150, 0.5), 
  y3 = rbinom(N, 150, 0.2), x = rnorm(N)
)
dat$size <- with(dat, y1 + y2 + y3)
dat$y <- with(dat, cbind(y1, y2, y3))
prior <-  prior(normal(0, 1), "Intercept") 
fit <- brm(bf(y | trials(size)  ~ 1), data = dat, 
           family = multinomial(refcat = NA), prior = prior)

If we call tidydraws:

tidy_draws(fit) %>% select(-contains("__"))
# A tibble: 4,000 x 6
   .chain .iteration .draw b_muy1_Intercept b_muy2_Intercept b_muy3_Intercept
    <int>      <int> <int>            <dbl>            <dbl>            <dbl>
 1      1          1     1          0.567              1.08            0.0644
 2      1          2     2          0.538              1.10            0.0724
 3      1          3     3          0.189              0.697          -0.284 
 4      1          4     4          0.231              0.800          -0.251 
 5      1          5     5         -0.240              0.301          -0.618 
 6      1          6     6         -0.206              0.292          -0.637 
 7      1          7     7         -0.0975             0.383          -0.614 
 8      1          8     8         -0.0840             0.471          -0.590 
 9      1          9     9         -0.00945            0.526          -0.465 
10      1         10    10          0.176              0.695          -0.266 
# … with 3,990 more rows

I want to double check I understand how to get from those Intercept values to the ones provided by fitted

fitted(fit)
, , P(Y = y1)

      Estimate Est.Error     Q2.5    Q97.5
 [1,] 42.07248  1.337674 39.50588 44.75072
 [2,] 48.68386  1.547880 45.71394 51.78297
 [3,] 43.27455  1.375893 40.63462 46.02931
 [4,] 45.37817  1.442777 42.60991 48.26684
 [5,] 39.36782  1.251680 36.96621 41.87388
 [6,] 43.27455  1.375893 40.63462 46.02931
 [7,] 42.67351  1.356783 40.07025 45.39001
 [8,] 48.68386  1.547880 45.71394 51.78297
 [9,] 47.18128  1.500106 44.30302 50.18473
[10,] 48.98438  1.557434 45.99613 52.10262
[11,] 45.07765  1.433222 42.32772 47.94720
[12,] 44.77713  1.423667 42.04554 47.62755
[13,] 44.77713  1.423667 42.04554 47.62755
[14,] 47.78231  1.519215 44.86739 50.82403
[15,] 49.58542  1.576544 46.56050 52.74192

Based on this

I think the answer is something like:

draws_temp<-tidy_draws(fit) %>% select(-contains("__")) %>% group_by(.chain,.iteration,.draw) %>%
  mutate(ExpIntercepts=exp(b_muy1_Intercept)+exp(b_muy2_Intercept)+exp(b_muy3_Intercept))%>%
  mutate(b_muy1_Intercept=exp(b_muy1_Intercept)/ExpIntercepts,
         b_muy2_Intercept=exp(b_muy2_Intercept)/ExpIntercepts,
         b_muy3_Intercept=exp(b_muy3_Intercept)/ExpIntercepts)

Which would give you P(Y=y1), P(Y=y2) etc. And this can then be used to get the fitted values, which can then be summarised:

dat %>% select(size) %>%
  rowwise() %>%
  mutate("P(Y=y1)"=mean(size*draws_temp$b_muy1_Intercept),
         "P(Y=y2)"=mean(size*draws_temp$b_muy2_Intercept),
         "P(Y=y3)"=mean(size*draws_temp$b_muy3_Intercept))

# A tibble: 15 x 4
    size `P(Y=y1)` `P(Y=y2)` `P(Y=y3)`
   <int>     <dbl>     <dbl>     <dbl>
 1   140      42.1      71.3      26.6
 2   162      48.7      82.5      30.8
 3   144      43.3      73.3      27.4
 4   151      45.4      76.9      28.7
 5   131      39.4      66.7      24.9
 6   144      43.3      73.3      27.4
 7   142      42.7      72.3      27.0
 8   162      48.7      82.5      30.8
 9   157      47.2      80.0      29.9
10   163      49.0      83.0      31.0
11   150      45.1      76.4      28.5
12   149      44.8      75.9      28.3
13   149      44.8      75.9      28.3
14   159      47.8      81.0      30.2
15   165      49.6      84.0      31.4

Thanks, and thanks for brms, it’s amazing.

Yes, this is exactly how it is done (in intercept only models).