If I am reading the documentation right, categorical_logit(beta) means categorical(softmax(beta)). I.e. categorical(probablity_vector), where beta[i] proportional to log(probablity_vector[i]), because softmax(beta) = exp(beta)/sum(exp(beta)). Alternatively, one could of course have used beta[i] proportional to logit(probability_vector[i]), but if I understand this correctly, that is not what happens?

If I looked at that correctly, then why do we call this categorical_logit and not categorical_log? Is that just one of those historical things (as we can see this has many names anyway) or did I get something wrong?

Venturing a non-expert guess: the categorial() distribution wants inputs that are probabilities that sum to 1. The betas are values on a log-odds (a.k.a. logit) scale, and the softmax function ensures a mapping of the logit values to probability values such that the latter sum to 1. So you can think of the name categorical_logit() as implying starting with logit values and ending with a categorical() distribution, via a softmax transform. You are correct that the softmax transform operates on the log scale of the logit values, but I donâ€™t think that because of that it makes any more sense to call it categorical_log(), as that then omits the idea that the beta values are log-odds.

The weird thing is that they are not logits. I.e. inv_logit(beta) (even after standadizing them) has nothing to do with the probabilities. Simple example:

library(rstan)

scode <- "
data {
int records;
int categories;
int y[records];
}
parameters {
vector[categories] beta;
}
model {
sum(beta) ~ std_normal();
y ~ categorical_logit(beta);
}
generated quantities {
simplex[categories] probs;
probs = softmax(beta);
}
"
sfit <- stan(model_code=scode,
data=list(records=200,
categories=4,
y=c(rep(1,140), rep(2, 30), rep(3,20), rep(4,10))))

Thatâ€™s quite possibly the case here. I donâ€™t remember who named the function or we could ask them, although someone could check the github history and maybe find some discussion about it in a PR or issue.

Edit: I think @Erik_Strumbeljâ€™s answer is actually the explanation

The naming seems to be consistent with how generalized linear models are typically named (distribution family + a link function, the inverse of which is applied to the linear terms).

Am I missing something here, because if prob[i] = exp(beta[i])/sum(exp(beta)) then beta[i] is not proportional to log(prob[i]) at least not as a function of beta[i]. OK, for any given set of beta you could treat the sum(exp(beta)) as a constant in which case the betas would be the log(prob) + some constant, which is what you have in your empirical example. But if you moved a beta[i], log(prob[i]) would not move proportionately.

Yes, in the sense that maybe more people would then immediately understand what distribution/likelihood it is. Especially people with more of a machine learning background.

But changing it to that in Stan would go against the current convention of using the link function not its inverse. For example, bernoulli_logit and poisson_log_glm would then have to become bernoulli_inv_logit and poisson_exp_glm, which would I imagine cause a lot of confusion, especially among statisticians that are used to the current convention from other software.

The link function is logit.
The inverse link function is exp(L)/(1 + exp(L)). When you have multiple logits, then itâ€™s exp(L_k)/(1 + sum^(K-1)(exp(L_k)) (assuming one category is held to be zero, a reference category). This is the same thing as exp(L_k)/(sum(exp(L_k))) without a reference category. In other words - softmax is the inverse link function of multiple logits. The functions with _logit are expecting /logit values/, and inverse-transforming them into something the likelihood understands. Or - â€śThe categorical distribution, but with logit input parameterizationâ€ť.