Hierarchical Categorical Logit with large predictor matrix fails to initialise

Dear all,

I’ve recently transitioned from JAGS to STAN for the purpose of fitting large-scale categorical logit models. I’m running into a problem with the model below, in that the model fails to initialize:

Rejecting initial value:
Chain 1: Log probability evaluates to log(0), i.e. negative infinity.
Chain 1: Stan can’t start sampling from this initial value.

I’m not sure how to diagnose this - could you point me in the right direction?



data {
int N; // N. Obs
int Nv; // N. Levels
int Nz; // N. Covariates
int Nj; // N. Choices

int V[N]; // Choices
matrix[N,Nz] Z; // Covariates
int var_id[Nz]; // Level-id per Covariate

parameters {
matrix[Nz,Nj] Beta; // Regression Coefficients of interest
matrix[Nz,Nj - 1] Theta_raw; // Prior Mean
vector[Nj-1] alpha_raw; // Choice-Intercept 

cholesky_factor_corr[Nz] L_Omega; // off-diagonal Cholesky Prior
vector<lower=0>[Nv] L_sigma_raw; // Prior Scale

transformed parameters {
         vector[Nj] alpha;
vector<lower=0>[Nz] L_sigma;
matrix[Nz, Nz] L_Sigma;
       matrix[N,Nj] ZB;
      matrix[Nz,Nj] Theta;
       vector[Nj] nu[N];

for(z in 1:Nz) L_sigma[z] = L_sigma_raw[var_id[z]];
                  L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

for (z in 1:Nz) {
      Theta[z,1] = 0;
for (j in 1:(Nj-1)) { 
Theta[z,j + 1] = Theta_raw[z ,j];
} }

      alpha[1] = 0;
for (j in 1:(Nj-1)) { 
alpha[j + 1] = alpha_raw[j];
          ZB = Z * Beta;

for (i in 1:N) for (j in 1:Nj)  nu[i][j] = alpha[j] + ZB[i,j];

model {
for(j in 1:(Nj-1)) Theta_raw[,j] ~ normal(0, 10);
  for (v in 1:Nv) L_sigma_raw[v] ~ cauchy(0,2.5);
                         L_Omega ~ lkj_corr_cholesky(4);

for(j in 1:Nj) Beta[,j] ~ multi_normal_cholesky(Theta[,j], L_Sigma);
for (i in 1:N) V[i] ~ categorical_logit(nu[i]);


All Zs are standardised; var_id specifies the groups of Z that share a variance parameter for regularisation purposes; Nz is approximately 1,000; N is approximately 30,000.

I’m not sure where the problem would be, but estimating a 1000x1000 correlation matrix seems pretty hard.

Can you make L_Omega data and see if the model runs?

Thanks Ben, I will try that and report back.

Relatedly, I have two further questions:

  1. In JAGS it was the case that using the multinomial-poisson transform would provide better mixing and faster run-time - is it the case in STAN as well, in your experience?

  2. For some practical reasons in the model above I’m using standardised dummies with a common variance to represent all my random effects, including some that could be written simply as random intercepts in STAN.
    For instance, we have 650 areas, and I choose to bring them in as 650 separate standardised dummies with a common variance, rather than as a random intercept model with a nested index. Do you suspect this creates some problems? I’d have thought there would be no difference in computation time, as the number of parameters is the same and I’m assuming the estimation steps would be identical in both cases?

Many thanks for your help!

I’m not familiar with JAGS. What are the two options here? It’s multinomial-poisson vs. what?

As far as random effects in Stan goes, the biggest thing is probably doing a non-centered parameterization instead of a centered parameterization to avoid divergences (Diagnosing Biased Inference with Divergences).

So that means instead of thinking about:

parameters {
  real a;
  real mu;
  real sigma;

model {
  ... // priors on mu and sigma
  a ~ normal(mu, sigma);


parameters {
  real z;
  real mu;
  real sigma;
transformed parameters {
  real a = mu + z * sigma;
model {
  ... // priors on mu and sigma
  z ~ normal(0, 1);

Which specifies the same prior on a but the second samples much better in Stan cause of how the algorithms work.

But again I’m not familiar with the terms, what’s the difference in a standardised dummy with a common variance vs. a random intercept?


Thanks for taking the time.

  1. the two options are:

a) a multinomial/categorical likelihood, similar to what follows, where y is the vector of choice-ids:
y[i] ~ categorical(pi[i,1:Nj])
pi[i,j] = softmax(mu[i,j]) = XB[i,j]


b) a poisson likelihood with an observation specific intercept lambda, with a non-informative prior, to `softly’ sum to 1, where y is a one-hot encoding matrix (the same vector as above but projected to a 0/1 matrix):

y[i,j] ~ poisson(mu[i,j])
log(mu[i,j]) = lambda[i,j] + XB[I,j]
lambda[i,j] ~ N(0,1e6)

The latter was a better choice in JAGS because the poisson sampler was generally faster and produced more stable results.

  1. thanks for the tip about non-centering, I will try that and it makes sense.

What I mean by a `standardised dummy with a common variance’ is as follows: suppose I have a vector of individual-level attributes, such as religion, for instance;

I can plug that into Stan as a vector of religion-ids, namely j = 1 for christian; 2 for buddhist etc. , and introduce it into a regression with a nested index as follows:

for(I in 1:N){
y[i] ~ N(mu[I],sigma);
mu[i] = theta[rel_id[i]] + XB;

for(j in 1:Nj){
theta[j] ~ N(0,sigma_theta)
sigma_theta ~ cauchy(0,2.5)

The alternative I implement above is to turn the religion-id vector into a one-hot encoded matrix Z (of dimensions N,Nj); the I standardise these (so Z*_ij = (Z_ij-bar(Z)_j)/sd(Z)_j ), and input them in the model as follows:

for(I in 1:N){
y[i] ~ N(mu[I],sigma);
mu[i] = Z*theta + XB;

for(j in 1:Nj){
theta[j] ~ N(0,sigma_theta)
sigma_theta ~ cauchy(0,2.5)

Does that make sense? My original motivation for doing this was:

a) to speed up computation and help convergence (the same reason why we would centre covariates in any regression model);

b) more importantly, this is a more flexible specification - suppose I thought religion-id and race-id should both share the same variance in my model; in the random intercept parametrisation, I’d have to write priors for two sets of estimates - say theta and eta; in the one-hot encoded parametrisation, I can just stick all the religion and race dummies into the same matrix, and allow for a single parameter vector theta.

In JAGS I had also implemented some level of adaptive shrinkage with this - namely by allowing the model to choose how many different variance components it would make sense to have, and which one-hot encoded columns should share them.

I hope that made some sort of sense - do let me know your thoughts and thanks again for the help, still haven’t had a chance to run the above model with omega as data.


This is the way I’d do random effects in Stan.

vector[N] mu = X * B; // fixed effects
for(I in 1:N){
  mu[l] = mu[l] + theta[rel_id[l]]; // add on random effects
  y[l] ~ N(mu[I],sigma);

And non-center the thetas:

for(j in 1:Nj){
  theta[j] = theta_z[j] * sigma_theta;

Where theta is defined at the top of the model block and theta_z is a parameter. This mostly works well.

I think you can definitely code the random effects with a matrix like that, but then you’d need to be careful to use sparse matrix vector products to make it efficient. I’m not familiar with standardizing the matrix like that.

I asked @jonah/@Bob_Carpenter about the multinomial vs. Poisson thing. They said @bgoodri at some point thought the Poisson trick would be good, but I don’t know if anyone has actually tried.

So no solid advice there. Best we can say is try both things and see what happens. If you get a chance to compare them both, we’d be curious to hear how it works out.