Multinomial to Poisson transformation

My question is primarily a math question about the Multinomial to Poisson transformation,
referring to a post about how to use the multinominal distribution in brms.

See: https://github.com/paul-buerkner/brms/issues/338#issuecomment-369760462
The math based on:
https://stats.stackexchange.com/questions/105282/logistic-multinomial-regression-as-two-multiple-poisson-regressions

\log \frac{\mu_{ij}}{\mu_{iJ}} = \alpha_j + X_i \beta_j

is transformed to:

\log\, \mu_{ij} = \eta + \theta_i + \alpha_j^* + X_i\beta_j^*

and the brms fit by Paul is:

fit2 <- brm(y ~ key + key:x + (1 | obs), df_long, family = poisson())

Since \alpha_j = \alpha^*_j − \alpha^*_J
it follows: \alpha_J = 0 (the reference group).

\eta is the intercept in fit2, and (1 | obs) refering to \theta_i.

Is this model identifiable? Why do we need \theta_i?

The brms fit:

 Family: poisson 
  Links: mu = log 
Formula: y ~ key + key:x + (1 | obs) 
   Data: df_long (Number of observations: 300) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
         total post-warmup samples = 4000
    ICs: LOO = NA; WAIC = NA; R2 = NA
 
Group-Level Effects: 
~obs (Number of levels: 100) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     0.02      0.02     0.00     0.06       2184 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     1.57      0.06     1.44     1.69       2509 1.00
keyy2         1.18      0.07     1.04     1.33       2655 1.00
keyy3         1.66      0.07     1.52     1.79       2653 1.00
keyy1:x       0.77      0.08     0.62     0.93       2886 1.00
keyy2:x      -0.36      0.06    -0.46    -0.25       4000 1.00
keyy3:x      -0.85      0.05    -0.95    -0.75       4000 1.00

Considering glm:
fit3 <- glm(y ~ key + key:x, data=df_long, family = poisson)
summary(fit3)

Call:
glm(formula = y ~ key + key:x, family = poisson, data = df_long)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.11253  -0.65366   0.01812   0.60466   2.06867  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.56862    0.06455  24.302  < 2e-16 ***
keyy2        1.17994    0.07380  15.988  < 2e-16 ***
keyy3        1.65664    0.07044  23.519  < 2e-16 ***
keyy1:x      0.77511    0.07801   9.936  < 2e-16 ***
keyy2:x     -0.35613    0.05575  -6.387 1.69e-10 ***
keyy3:x     -0.85128    0.05155 -16.515  < 2e-16 ***

and the multinom:

multinom(formula = Y ~ x, data = df)

Coefficients:
  (Intercept)         x
2    1.179938 -1.131244
3    1.656643 -1.626395

Std. Errors:
  (Intercept)          x
2  0.07380438 0.09588863
3  0.07043854 0.09350489

These fit matches, the slope math:

> -0.35613 - 0.77511
[1] -1.13124
> -0.85128 - 0.77511
[1] -1.62639

There is no need for \theta_i, so why brms needs the (1 | obs) term in the formula?
This is I don’t understand. Is it a Stan thing?

As you see above, the model is definitly identifiable. Don’t forget we set a prior on \theta_j. In the above model sd(obs) is super small. What happens if you simulate data so that it is larger?

Thanks Paul, Stan is tolerant in any means.
The idea of \theta_i is fitting the normalizing constant in the category.
And later in the link it’s said is canceled out.

If I fit the model with \theta_i and without and compare the coefficients, the
difference is about 1e-10. Thus do I need \theta_i? If not, I would save a lot
of parameters is Stan.

Added later: It’s pathological with this special case example.
However \theta_i can be N - 1 dimensional, if not have to.

You mean for your particular example or in general?

For your particular example you may be right that dropping theta is ok,
as we see the results of the two models being very similar. However, in order to know that, we had to fit the model including theta anyway.

In general, I have my doubts that you can always drop that. I haven’t investigated this in detail but I would think that as soon as the number of trials differs across observations, thetas may have a considerably non-zero SD and thus the results may differ. So unless you can show me a thorough mathematical analysis or simulation study, I would be hesitant to conclude theta could just be dropped (in general).

Please also note that equalty of point-estimates (say, the mean) doesn’t imply equality of posterior distributions. For instance, you will also need the SE (and CIs etc.) to be comparable (which is of course not necessarily sufficient to see equality of the whole posterior distribution).

1 Like