My first guess is that your model is overparametrized: the categorical
distribution with C
categories is completely determined by C - 1
parameters (because the probabilitites need to sum to 1). However, if I read the model correctly, you predict C
parameters. Usually, when working on the logit scale (i.e. before applying softmax
) one would fix one of the vector elements to 0
. Such overparametrization both prevents you from interpreting the coefficents in a useful way and also creates weird interdependencies between the parameters that are hard for the sampler to work with.
I discussed a similar issue recently at Two questions: ①Rejecting initial value but still sampling. ②regarding divergent transitions but feel free to ask for clarifications here, if it is hard to understand.
Few additional minor suggestions:
- You can use the
cholesky_factor_cov
type so that the sample will work directly with the decomposition (this avoids having to decompose the matrix in thetransformed parameters
block and is usually more numerically stable). In many use cases it is recomended to separate the correlation matrix and the variance vector - then you can use thecholesky_factor_corr
type and thelkj_corr_cholesky
prior. -
categorical_logit(X)
is a more efficient shorthand forcategorical(softmax(X))
Best of luck with the model!