Why is it called categorical_logit not categorical_log?

If I am reading the documentation right, categorical_logit(beta) means categorical(softmax(beta)). I.e. categorical(probablity_vector), where beta[i] proportional to log(probablity_vector[i]), because softmax(beta) = exp(beta)/sum(exp(beta)). Alternatively, one could of course have used beta[i] proportional to logit(probability_vector[i]), but if I understand this correctly, that is not what happens?

If I looked at that correctly, then why do we call this categorical_logit and not categorical_log? Is that just one of those historical things (as we can see this has many names anyway) or did I get something wrong?

2 Likes

Venturing a non-expert guess: the categorial() distribution wants inputs that are probabilities that sum to 1. The betas are values on a log-odds (a.k.a. logit) scale, and the softmax function ensures a mapping of the logit values to probability values such that the latter sum to 1. So you can think of the name categorical_logit() as implying starting with logit values and ending with a categorical() distribution, via a softmax transform. You are correct that the softmax transform operates on the log scale of the logit values, but I donā€™t think that because of that it makes any more sense to call it categorical_log(), as that then omits the idea that the beta values are log-odds.

The weird thing is that they are not logits. I.e. inv_logit(beta) (even after standadizing them) has nothing to do with the probabilities. Simple example:

library(rstan)

scode ā† "
data {
int records;
int categories;
int y[records];
}
parameters {
vector[categories] beta;
}
model {
sum(beta) ~ std_normal();
y ~ categorical_logit(beta);
}
generated quantities {
simplex[categories] probs;
probs = softmax(beta);
}
"
sfit ā† stan(model_code=scode,
data=list(records=200,
categories=4,
y=c(rep(1,140), rep(2, 30), rep(3,20), rep(4,10))))

round(summary(sfit)$summary,4)

The output you get is:

          mean se_mean     sd      2.5%       25%       50%       75%     97.5%    n_eff   Rhat

beta[1] 1.5510 0.0080 0.2893 0.9819 1.3567 1.5628 1.7488 2.1055 1321.214 1.0043
beta[2] 0.0005 0.0080 0.3107 -0.6215 -0.2033 0.0070 0.2148 0.5958 1504.437 1.0029
beta[3] -0.4104 0.0080 0.3198 -1.0540 -0.6187 -0.4014 -0.1863 0.2004 1607.080 1.0029
beta[4] -1.1380 0.0082 0.3554 -1.8445 -1.3774 -1.1317 -0.8943 -0.4482 1887.195 1.0013
probs[1] 0.6994 0.0005 0.0326 0.6334 0.6777 0.7000 0.7218 0.7611 4814.550 1.0006
probs[2] 0.1504 0.0004 0.0255 0.1051 0.1320 0.1495 0.1665 0.2038 4844.375 1.0004
probs[3] 0.1005 0.0003 0.0210 0.0623 0.0859 0.0990 0.1139 0.1461 4124.470 0.9997
probs[4] 0.0498 0.0003 0.0152 0.0242 0.0390 0.0480 0.0589 0.0839 3434.646 1.0001
lp__ -184.8894 0.0371 1.4627 -188.5742 -185.5919 -184.5452 -183.8324 -183.0986 1550.569 0.9997

I.e. the betas are proportional to the logs of the probabilities, not the logits (as the softmax function implies).

1 Like

Hm. Good sleuthing. Iā€™m flummoxed!

Interesting question.

Thatā€™s quite possibly the case here. I donā€™t remember who named the function or we could ask them, although someone could check the github history and maybe find some discussion about it in a PR or issue.

Edit: I think @Erik_Strumbeljā€™s answer is actually the explanation

The naming seems to be consistent with how generalized linear models are typically named (distribution family + a link function, the inverse of which is applied to the linear terms).

Am I missing something here, because if prob[i] = exp(beta[i])/sum(exp(beta)) then beta[i] is not proportional to log(prob[i]) at least not as a function of beta[i]. OK, for any given set of beta you could treat the sum(exp(beta)) as a constant in which case the betas would be the log(prob) + some constant, which is what you have in your empirical example. But if you moved a beta[i], log(prob[i]) would not move proportionately.

2 Likes

In a sense categorical_softmax would seem like the best description.

Yes, in the sense that maybe more people would then immediately understand what distribution/likelihood it is. Especially people with more of a machine learning background.

But changing it to that in Stan would go against the current convention of using the link function not its inverse. For example, bernoulli_logit and poisson_log_glm would then have to become bernoulli_inv_logit and poisson_exp_glm, which would I imagine cause a lot of confusion, especially among statisticians that are used to the current convention from other software.

2 Likes

However, is the inverse link function that is being used here the logit? I know categorical_inv_softmax is not exactly prettyā€¦

1 Like

The link function is logit.
The inverse link function is exp(L)/(1 + exp(L)). When you have multiple logits, then itā€™s exp(L_k)/(1 + sum^(K-1)(exp(L_k)) (assuming one category is held to be zero, a reference category). This is the same thing as exp(L_k)/(sum(exp(L_k))) without a reference category. In other words - softmax is the inverse link function of multiple logits. The functions with _logit are expecting /logit values/, and inverse-transforming them into something the likelihood understands. Or - ā€œThe categorical distribution, but with logit input parameterizationā€.

4 Likes