# Why is it called categorical_logit not categorical_log?

If I am reading the documentation right, categorical_logit(beta) means categorical(softmax(beta)). I.e. categorical(probablity_vector), where beta[i] proportional to log(probablity_vector[i]), because softmax(beta) = exp(beta)/sum(exp(beta)). Alternatively, one could of course have used beta[i] proportional to logit(probability_vector[i]), but if I understand this correctly, that is not what happens?

If I looked at that correctly, then why do we call this categorical_logit and not categorical_log? Is that just one of those historical things (as we can see this has many names anyway) or did I get something wrong?

2 Likes

Venturing a non-expert guess: the `categorial()` distribution wants inputs that are probabilities that sum to 1. The `betas` are values on a log-odds (a.k.a. logit) scale, and the softmax function ensures a mapping of the logit values to probability values such that the latter sum to 1. So you can think of the name `categorical_logit()` as implying starting with logit values and ending with a `categorical()` distribution, via a softmax transform. You are correct that the softmax transform operates on the log scale of the logit values, but I don’t think that because of that it makes any more sense to call it `categorical_log()`, as that then omits the idea that the `beta` values are log-odds.

The weird thing is that they are not logits. I.e. inv_logit(beta) (even after standadizing them) has nothing to do with the probabilities. Simple example:

library(rstan)

scode <- "
data {
int records;
int categories;
int y[records];
}
parameters {
vector[categories] beta;
}
model {
sum(beta) ~ std_normal();
y ~ categorical_logit(beta);
}
generated quantities {
simplex[categories] probs;
probs = softmax(beta);
}
"
sfit <- stan(model_code=scode,
data=list(records=200,
categories=4,
y=c(rep(1,140), rep(2, 30), rep(3,20), rep(4,10))))

round(summary(sfit)\$summary,4)

The output you get is:

``````          mean se_mean     sd      2.5%       25%       50%       75%     97.5%    n_eff   Rhat
``````

beta 1.5510 0.0080 0.2893 0.9819 1.3567 1.5628 1.7488 2.1055 1321.214 1.0043
beta 0.0005 0.0080 0.3107 -0.6215 -0.2033 0.0070 0.2148 0.5958 1504.437 1.0029
beta -0.4104 0.0080 0.3198 -1.0540 -0.6187 -0.4014 -0.1863 0.2004 1607.080 1.0029
beta -1.1380 0.0082 0.3554 -1.8445 -1.3774 -1.1317 -0.8943 -0.4482 1887.195 1.0013
probs 0.6994 0.0005 0.0326 0.6334 0.6777 0.7000 0.7218 0.7611 4814.550 1.0006
probs 0.1504 0.0004 0.0255 0.1051 0.1320 0.1495 0.1665 0.2038 4844.375 1.0004
probs 0.1005 0.0003 0.0210 0.0623 0.0859 0.0990 0.1139 0.1461 4124.470 0.9997
probs 0.0498 0.0003 0.0152 0.0242 0.0390 0.0480 0.0589 0.0839 3434.646 1.0001
lp__ -184.8894 0.0371 1.4627 -188.5742 -185.5919 -184.5452 -183.8324 -183.0986 1550.569 0.9997

I.e. the betas are proportional to the logs of the probabilities, not the logits (as the softmax function implies).

1 Like

Hm. Good sleuthing. I’m flummoxed!

Interesting question.

That’s quite possibly the case here. I don’t remember who named the function or we could ask them, although someone could check the github history and maybe find some discussion about it in a PR or issue.

Edit: I think @Erik_Strumbelj’s answer is actually the explanation

The naming seems to be consistent with how generalized linear models are typically named (distribution family + a link function, the inverse of which is applied to the linear terms).

Am I missing something here, because if prob[i] = exp(beta[i])/sum(exp(beta)) then beta[i] is not proportional to log(prob[i]) at least not as a function of beta[i]. OK, for any given set of beta you could treat the sum(exp(beta)) as a constant in which case the betas would be the log(prob) + some constant, which is what you have in your empirical example. But if you moved a beta[i], log(prob[i]) would not move proportionately.

2 Likes

In a sense categorical_softmax would seem like the best description.

Yes, in the sense that maybe more people would then immediately understand what distribution/likelihood it is. Especially people with more of a machine learning background.

But changing it to that in Stan would go against the current convention of using the link function not its inverse. For example, bernoulli_logit and poisson_log_glm would then have to become bernoulli_inv_logit and poisson_exp_glm, which would I imagine cause a lot of confusion, especially among statisticians that are used to the current convention from other software.

2 Likes

However, is the inverse link function that is being used here the logit? I know categorical_inv_softmax is not exactly pretty…

1 Like