Speeding up a hierarchical multinomial logit model

First, you should take @stijn’s advice and use a multi_normal_cholesky. You should also vectorize the call to multi_normal_cholesky—that way your covariance matrix is only factored once.

Also as @stijn points out, if the model’s misspecified for the data (not a coding error, just not a good model for the data), it can be hard for Stan to sample.

You only want to compute X[r, s] * Beta[ , r] once—store it and reuse the result. I don’t think we’ve vectorized categorical logit, but there’s not much shared computation here. Just like @stijn, I’d be worried about those last two linked regressions.

You want to turn Theta * Z into a matrix multiplication, then break into an array to feed into multi_normal_cholesky.

Pinning one of the categoical logit parameters to zero by pinning the relevant coefficient to zero is often done to induce identifiability—the K-value parameterization is only identified through the prior.

1 Like

Thanks Bob. I have implemented all your suggestions as well as @stijn’s and it is more than twice as fast now! The results make sense and predictive accuracy is high so I don’t think there is an issue with the model specification. My current code is below:

data {
  int<lower=2> C; // Number of alternatives (choices) in each scenario
  int<lower=1> K; // Number of alternatives
  int<lower=1> R; // Number of respondents
  int<lower=1> S; // Number of scenarios per respondent
  int<lower=1,upper=C> YB[R, S]; // best choices
  int<lower=1,upper=C> YW[R, S]; // worst choices
  matrix[C, K] X[R, S]; // matrix of attributes for each obs

parameters {
  vector[K] Beta[R];
  vector[K - 1] Theta_raw;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0, upper=pi()/2>[K] L_sigma_unif;

transformed parameters {
  vector<lower=0>[K] L_sigma;
  matrix[K, K] L_Sigma;
  vector[C] XB[R, S];
  vector[K] Theta;

  for (k in 1:K) {
    L_sigma[k] = 2.5 * tan(L_sigma_unif[k]);

  L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

  Theta[1] = 0;
  for (k in 1:(K-1)) {
    Theta[k + 1] = Theta_raw[k];

  for (r in 1:R) {
    for (s in 1:S) {
      XB[r,s] = X[r,s] * Beta[r];

model {
  Theta_raw ~ normal(0, 10);
  L_Omega ~ lkj_corr_cholesky(4);

  Beta ~ multi_normal_cholesky(Theta, L_Sigma);
  for (r in 1:R) {
    for (s in 1:S) {
      YB[r,s] ~ categorical_logit(XB[r,s]);
      YW[r,s] ~ categorical_logit(-XB[r,s]);
1 Like

Hello Justin,
I am new to Stan and found this post most interesting, I assume it is related to your paper on trick logit in Displayr, which I also found very useful. I wonder if I can be given access to a small sub-sample of the data structure used to run di Stan code (I use Rstan) to better understand its mechanics.

Sure @Ric , please have a look at the unit tests in https://github.com/Displayr/flipMaxDiff/blob/master/tests/testthat/test-hierarchicalbayes.R

Minor point: I think the comment on the Data section line
int<lower=1> K; // Number of alternatives
should read “// Number of attributes”.

1 Like

Re: “Pinning one of the categoical logit parameters to zero by pinning the relevant coefficient to zero is often done to induce identifiability—the K-value parameterization is only identified through the prior.”

“Engineering” the identifibility issue generally requires more than just pinning one of the coefficients to zero. I think there may be confusion of K here (where it is the number columns of the X matrices) with the K of [https://github.com/stan-dev/stan/issues/266] (were it represents the number of choice categories, corresponding to C in @JustinYap’s code ).
Typically, subsets of the K columns of the X matrices are associated with specific categories and K is an integer (P) multiple of C (or of C-1, with identifiability constraints imposed). So the columns of X represent something like a vec(matrix[C-1,P]) in @ariddel’s remark at [https://github.com/stan-dev/stan/issues/266#issuecomment-50351532].

The html slides in unit 2 of [https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial] suggest that “effects coding” (aka “sum-to-zero coding” or in R as simply “sum coding”) of categorical contrasts may be better for symmetrical priors ( a concern in [https://github.com/stan-dev/stan/issues/266#issue-20713096] ) than pinning one reference category’s coefficients to zero (; they cite [http://bayesium.com/a-symmetric-effects-prior/] for details). This is taken care of in the ART-Forum tutorial examples in units 2 and 3 by constructing design matrices corresponding to X with only C-1 columns per attribute (plus C-1 intercepts). This effectively sets the coefficients for the C-th category to minus the sum of the coefficients of the other C-1 for each of the P ‘families’ of coefficients in K.

1 Like

Thanks for your comments @tnearey, in my case I think of K as the number of alternatives. I will try out effects coding.

Justin, did you come to a final model that worked and ran well for your best-worst purposes? If so, could you show a template for this analysis? I am interested it using it to calculate individual-level estimates from a best-worst scaling deign.

I’m slowly working through the snippet above, and matrix[C, K] X[R, S]; looks to be for covariates? If I just want to calculate utility estimates for each individual-item combination, I don’t need this, do I?

The one I used was more or less the same as the one I last posted. I haven’t implemented anything for covariates for this. matrix[C, K] X[R, S] just contains the design. The individual-level estimates are beta

1 Like

Interesting. What do you mean by the design? That is, a 1 if the item was included and a 0 if it wasn’t? I can’t square how that leads to a C by K matrix of R by S arrays. Would it be too much to ask for you to provide an example of data you would feed to this script? I’d appreciate it.

Yes, 1 if it is included and 0 if it isn’t. Lets say there are 10 alternatives and but only 3 choices in each scenario, e.g. 2,3,7. Then C = 3 and K = 10 and the matrix would be

0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0

1 Like

Thank you! Sorry, last clarification question: in YB[R, S] and YW[R, S], do the choices refer to the row of each matrix[C, K] in X[R, S]? So if the design for X[1, 1] was:

0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0

And the best choice was the third alternative, would the integer in YB[1, 1] be 2, since it is the second row?

I first tried to give it an integer that represents the alternative, but, for example, if I wanted to say alternative 7 was the worst, this is above C and got rejected by my sampling call to rstan::stan() because of it. I assume I would say 3 if I wanted to say that 7 was the worst?

That is correct @markhw

1 Like

Did you get it to go any faster? I can get it to run—but I have 1000 participants judging 12 different alternatives across 9 scenarios with 4 options per scenario, and the model is taking a while to run. If I limit it to just 30 participants, I can get it to run and the results correlate with other methods at like .99, so I know it’s working.

Not really, unfortunately you’ll just have to wait for it to finish

Good. This explains my confusion over C and K above.
Your original post keeps the first two lines of Feit’s hmnl.stan:

int<lower=2> C; // Number of alternatives (choices) in each scenario
int<lower=1> K; // Number of covariates of alternatives

You changed that in your revised version, with good reason.
But you’re using a different kind of design matrices than in the vanHorn/Feit code. https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial/blob/master/3_hmnl/hmnl.stan

In the ‘chocolate’ example they provide,
the first C x K corresponding to X[1,1] in the stan data
is packed in R as

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]  0.6    1    0    0    0    0    0    1    0
[2,]  0.7    0    0    1    0    1    0    0    0
[3,]  3.6    1    0    0    0   -1   -1   -1   -1

Interesting. I’m running a design with 350 respondents, 13 tasks, 4 alternatives per task, and 13 alternatives in total, and using that code above takes multiple hours to run. Have you not gotten it to speed up from that?

That sounds slow. It might be easiest if you try it out in our product called Displayr. We have a feature called MaxDiff which uses an updated version of the stan code:

1 Like

Interesting. Thank you for the link, but I’m trying to get this to work in my established workflow. I will probably create a new topic, making sure that I have the data formatted correctly, rather than keep bothering you on this thread. Thank you!

I posted my own question, with all my data and R code, here: Troubleshooting a "MaxDiff" hierarchical multinomial regression I would very much appreciate if you could take a look at my code and see what I may be doing incorrectly!