Efficient computation of multi-logit

Hello,

I am very new to Stan and the bayesian world, so apologies if this question is basic. I am trying to fit a multinomial logit model but I would like to vectorise it so that it runs as fast as possible. So far I have only used one predictor, but fitting the model takes a very long time. Here is my code so far

data {
  int<lower=1> N;             // Number of observations
  int<lower=1> K;             // Number of categories
  int<lower=1> P;             // Number of predictors
  matrix[N, P] X;             // Predictor matrix
  array[N] int<lower = 0, upper = K> y; // Outcome variable
}
parameters {
  row_vector[K - 1] alpha_raw;         // Intercepts for categories 2 to K
  matrix[P, K - 1] beta_raw;   // Coefficients for categories 2 to K
}
transformed parameters {
  matrix[P, K] beta = append_col(beta_raw, rep_vector(0, P));
  row_vector[K] alpha = append_col(alpha_raw, 0);
}
model {
  // Priors
  alpha ~ normal(0, 1); 
  to_vector(beta) ~ normal(0, 5);

  
  //y ~ categorical_logit(alpha'); This works! But I don't know why exactly. It prints more than just alpha (also beta)
  
  y ~ categorical_logit(alpha' + (X * beta)'); // Does not seem to work
  
  //for (n in 1:N){
    //y[n] ~ categorical_logit(alpha' + (X[n] * beta)'); // Works but very slowly
  //}
}

Basically I tried to fit models three different ways. When I go for an intercept-only model it fits incredibly fast but also weirdly includes beta in its calculations. I don’t know why that works so well. When I put the X * Beta in the same equation (shown in the second fitting model) doesn’t work, and when I do it in a for-loop (third fit) it works but just really slowly.

Any solutions for this? Any help is appreciated.

Thank you and kind regards

I’d have to see the output from your model with only y ~ categorical_logit(alpha') to be sure, but I suspect that the output you’re seeing for beta in that case is simply the prior on beta, i.e., a normal(0,5).

I’m always uncertain about my matrix algebra, but unless I’m mistaken X[n]*beta produces k-dimensional vector (matching the k-dimensions of alpha') and X * beta produces an N x k matrix. I believe that categorical_logit() requires a vector, not a matrix.

As for making things faster, one of the experts here will have to chime in. I’m really bad at optimizing Stan code.

Kent

1 Like

If that’s really the model you want, we have it built in as the categorical_logit_glm distribution.

We discuss these models and their identification in the multi-logit section of the User’s Guide, but I don’t think I mentions the GLM code.

Couple of issues.

matrix[N, P] X;
row_vector[K] alpha ...
matrix[P, K] beta ...

alpha' + (X * beta)'

You can work out the shapes working outward. X * beta has shape N x K, so (X * beta)' has shape K x N. alpha has shape 1 x K so alpha' has shape K x 1. Stan doesn’t know how to add a K x N matrix and a K x 1 vector. I think what you intend to do is add alpha to each column.

matrix[K, N] X_beta_tr = (X * beta)';
for (n in 1:N)
  X_beta[ , n] += alpha';

Then you can use the categorical-logit,

y ~ categorical_logit(X_beta);

which is all going to be faster without all the transposes with declarations as follows:

matrix[P, N] X_t;
vector[K] alpha_t;
matrix[K, P] beta_t;

matrix[K, N] X_beta_t = beta_t * X_t;
for (n in1:N)
X_beta_t[ , n] += alpha_t;


The problem is then that our `categorical_logit` distribution isn't vectorized on the parameters, at least according to the doc.

But the real thing you have to watch out for is identifiability.  One way to approach this is to pin one of the predictors in the categorical logit to zero.  That's easier to do both conceptually and in code after the fact rather than by setting a bunch of values to zero and then multiplying through only to get more zeros.

Hi,

Thank you both! Your help is very much appreciated. I will try to get the code working as soon as possible.

@Bob_Carpenter on the identifiability issue: Would you then say that the solution identified in the stan manual of making all predictors for one category zero is inefficient? If I go with what you propose, which is to set one of the predictors to zero, then does that annul the power of that predictor? Is it only there as a reference point?

Apologies if these questions are basic. My stan/bayesian is only just beginning so I have to get a feel for these things.

Kind regards,
Caio

It can be inefficient if it leads to difficult posterior geometry, but the bigger problem is that it’s hard to formulate priors when one of the values is pinned. What happens is that every other value is now interpreted relative to that fixed value.

There are usually two ways to identify: set one of the coefficients to zero, or enforce that the sum of the coefficients is zero. The latter can lead to symmetric priors, but it fundamentally assumes that you have the complete set because you can’t now add a new category as you can’t make it sum to zero any more.