Speeding up a hierarchical multinomial logit model

JustinYap · August 10, 2017, 6:42am

Hi,

I’m trying to speed up the modified hierarchical multinomial logit model shown below. Is there anything I can do to optimize it such as vectorizing the nested for loops? Right now using rstan on my data, 100 iterations with 1 chain takes 340 seconds, and reducing the iterations would have a negative impact on the results. I am comparing the performance against a commercial product (Sawtooth software) which uses Gibbs sampling, and it takes less than a minute to run 20000 iterations. Surely I must be doing something suboptimal with stan that makes it much slower. Any help would be greatly appreciated!

data {
  int<lower=2> C; // Number of alternatives (choices) in each scenario
  int<lower=1> K; // Number of covariates of alternatives
  int<lower=1> R; // Number of respondents
  int<lower=1> S; // Number of scenarios per respondent
  int<lower=0> G; // Number of respondent covariates 
  int<lower=1,upper=C> YB[R, S]; // best choices
  int<lower=1,upper=C> YW[R, S]; // worst choices
  matrix[C, K] X[R, S]; // matrix of attributes for each obs
  matrix[G, R] Z; // vector of covariates for each respondent
}

parameters {
  matrix[K, R] Beta;
  matrix[K, G] Theta;
  corr_matrix[K] Omega;
  vector<lower=0>[K] tau;
}
transformed parameters {
  cov_matrix[K] Sigma = quad_form_diag(Omega, tau);
}
model {
  //priors
  to_vector(Theta) ~ normal(0, 10);
  tau ~ cauchy(0, 2.5); 
  Omega ~ lkj_corr(2);
  //likelihood
  for (r in 1:R) {
    Beta[,r] ~ multi_normal(Theta*Z[,r], Sigma);
    for (s in 1:S) {
      YB[r,s] ~ categorical_logit(X[r,s] * Beta[,r]);
      YW[r,s] ~ categorical_logit(-X[r,s] * Beta[,r]);
    }
  }
}

This code is modified from the example in https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial/blob/master/3_hmnl/hmnl.stan

stijn · August 10, 2017, 11:21am

With respect to the stan model:

You should try to use the multi_normal_cholesky instead of multi_normal implementation. The details are explained on p. 150 of the Stan Manual 2.16.
If S is not very large, your are probably better off using the non-centered parameterization for beta. Chapter 27 of the Stan Manual 2.16 discusses this in detail.
I am not sure about vectorizing. Maybe you can remove the internal S loop entirely and the s index in YB, YW and X. I have my doubts though, the number of indices makes my head hurt.
I am a bit concerned with the last two lines. You have a very strong implied constraint on beta that the response to X will be exactly the opposite for the outcomes YB and YW. This might make sense but it is hard for me to say without more context. I would like to see what happens if you comment out any of those line.

I am not part of the stan team but if there are no obvious errors in the code I suspect the stan experts will be suspicious about the comparison with Sawtooth. The typical questions you might expect are.

Are you sure the models are identical, also the priors?
What is the n_eff/minute for the gibbs implementation?
Have you checked whether it recovers the parameters for simulated data?

Further, if you have can finish sampling with stan, diagnostic information such as tree_depth and divergences might help in figuring out what goes wrong. You should report them here. Maybe try a subsample of your dataset?

JustinYap · August 11, 2017, 7:37am

Thanks for all your suggestions! I will investigate each one.

My dataset isn’t large: R = 302, S = 6

As for the last two lines, removing the last line would make my model the same as the original hierarchical multinomial logit model on which it was based (there is a link to it in my previous post). YB is intended to represent the “best” choice and YW is the “worst” choice, when respondents are asked to choose their most and least favourite alternatives (known as MaxDiff).

I get the warning that I should increase maximum treedepth and when I increase it to from 10 to 12, it runs slower than before. I also get the warning about the estimated Bayesian Fraction of Missing Information being low but I haven’t been able to get rid of it by increasing the number of iterations. I almost never get the warning about divergent transitions.

Bob_Carpenter · August 14, 2017, 3:21pm

First, you should take @stijn’s advice and use a multi_normal_cholesky. You should also vectorize the call to multi_normal_cholesky—that way your covariance matrix is only factored once.

Also as @stijn points out, if the model’s misspecified for the data (not a coding error, just not a good model for the data), it can be hard for Stan to sample.

You only want to compute X[r, s] * Beta[ , r] once—store it and reuse the result. I don’t think we’ve vectorized categorical logit, but there’s not much shared computation here. Just like @stijn, I’d be worried about those last two linked regressions.

You want to turn Theta * Z into a matrix multiplication, then break into an array to feed into multi_normal_cholesky.

Pinning one of the categoical logit parameters to zero by pinning the relevant coefficient to zero is often done to induce identifiability—the K-value parameterization is only identified through the prior.

JustinYap · August 16, 2017, 7:38am

Thanks Bob. I have implemented all your suggestions as well as @stijn’s and it is more than twice as fast now! The results make sense and predictive accuracy is high so I don’t think there is an issue with the model specification. My current code is below:

data {
  int<lower=2> C; // Number of alternatives (choices) in each scenario
  int<lower=1> K; // Number of alternatives
  int<lower=1> R; // Number of respondents
  int<lower=1> S; // Number of scenarios per respondent
  int<lower=1,upper=C> YB[R, S]; // best choices
  int<lower=1,upper=C> YW[R, S]; // worst choices
  matrix[C, K] X[R, S]; // matrix of attributes for each obs
}

parameters {
  vector[K] Beta[R];
  vector[K - 1] Theta_raw;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0, upper=pi()/2>[K] L_sigma_unif;
}

transformed parameters {
  vector<lower=0>[K] L_sigma;
  matrix[K, K] L_Sigma;
  vector[C] XB[R, S];
  vector[K] Theta;

  for (k in 1:K) {
    L_sigma[k] = 2.5 * tan(L_sigma_unif[k]);
  }

  L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

  Theta[1] = 0;
  for (k in 1:(K-1)) {
    Theta[k + 1] = Theta_raw[k];
  }

  for (r in 1:R) {
    for (s in 1:S) {
      XB[r,s] = X[r,s] * Beta[r];
    }
  }
}

model {
  //priors
  Theta_raw ~ normal(0, 10);
  L_Omega ~ lkj_corr_cholesky(4);

  //likelihood
  Beta ~ multi_normal_cholesky(Theta, L_Sigma);
  for (r in 1:R) {
    for (s in 1:S) {
      YB[r,s] ~ categorical_logit(XB[r,s]);
      YW[r,s] ~ categorical_logit(-XB[r,s]);
    }
  }
}

Ric · August 31, 2018, 9:44pm

Hello Justin,
I am new to Stan and found this post most interesting, I assume it is related to your paper on trick logit in Displayr, which I also found very useful. I wonder if I can be given access to a small sub-sample of the data structure used to run di Stan code (I use Rstan) to better understand its mechanics.
Thanks
Ric

JustinYap · September 3, 2018, 7:31am

Sure @Ric , please have a look at the unit tests in https://github.com/Displayr/flipMaxDiff/blob/master/tests/testthat/test-hierarchicalbayes.R

tnearey · July 20, 2019, 6:45pm

Minor point: I think the comment on the Data section line
int<lower=1> K; // Number of alternatives
should read “// Number of attributes”.

tnearey · July 26, 2019, 9:04pm

Re: “Pinning one of the categoical logit parameters to zero by pinning the relevant coefficient to zero is often done to induce identifiability—the K-value parameterization is only identified through the prior.”

“Engineering” the identifibility issue generally requires more than just pinning one of the coefficients to zero. I think there may be confusion of K here (where it is the number columns of the X matrices) with the K of [https://github.com/stan-dev/stan/issues/266] (were it represents the number of choice categories, corresponding to C in @JustinYap’s code ).
Typically, subsets of the K columns of the X matrices are associated with specific categories and K is an integer (P) multiple of C (or of C-1, with identifiability constraints imposed). So the columns of X represent something like a vec(matrix[C-1,P]) in @ariddel’s remark at [https://github.com/stan-dev/stan/issues/266#issuecomment-50351532].

The html slides in unit 2 of [https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial] suggest that “effects coding” (aka “sum-to-zero coding” or in R as simply “sum coding”) of categorical contrasts may be better for symmetrical priors ( a concern in [https://github.com/stan-dev/stan/issues/266#issue-20713096] ) than pinning one reference category’s coefficients to zero (; they cite [http://bayesium.com/a-symmetric-effects-prior/] for details). This is taken care of in the ART-Forum tutorial examples in units 2 and 3 by constructing design matrices corresponding to X with only C-1 columns per attribute (plus C-1 intercepts). This effectively sets the coefficients for the C-th category to minus the sum of the coefficients of the other C-1 for each of the P ‘families’ of coefficients in K.

JustinYap · July 28, 2019, 11:11pm

Thanks for your comments @tnearey, in my case I think of K as the number of alternatives. I will try out effects coding.

markhw · August 15, 2019, 6:32pm

Justin, did you come to a final model that worked and ran well for your best-worst purposes? If so, could you show a template for this analysis? I am interested it using it to calculate individual-level estimates from a best-worst scaling deign.

I’m slowly working through the snippet above, and matrix[C, K] X[R, S]; looks to be for covariates? If I just want to calculate utility estimates for each individual-item combination, I don’t need this, do I?

JustinYap · August 18, 2019, 10:49pm

The one I used was more or less the same as the one I last posted. I haven’t implemented anything for covariates for this. matrix[C, K] X[R, S] just contains the design. The individual-level estimates are beta

markhw · August 20, 2019, 3:30am

Interesting. What do you mean by the design? That is, a 1 if the item was included and a 0 if it wasn’t? I can’t square how that leads to a C by K matrix of R by S arrays. Would it be too much to ask for you to provide an example of data you would feed to this script? I’d appreciate it.

JustinYap · August 21, 2019, 2:20am

Yes, 1 if it is included and 0 if it isn’t. Lets say there are 10 alternatives and but only 3 choices in each scenario, e.g. 2,3,7. Then C = 3 and K = 10 and the matrix would be

0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0

markhw · August 22, 2019, 8:39pm

Thank you! Sorry, last clarification question: in YB[R, S] and YW[R, S], do the choices refer to the row of each matrix[C, K] in X[R, S]? So if the design for X[1, 1] was:

0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0

And the best choice was the third alternative, would the integer in YB[1, 1] be 2, since it is the second row?

I first tried to give it an integer that represents the alternative, but, for example, if I wanted to say alternative 7 was the worst, this is above C and got rejected by my sampling call to rstan::stan() because of it. I assume I would say 3 if I wanted to say that 7 was the worst?

JustinYap · August 22, 2019, 10:58pm

That is correct @markhw

markhw · August 23, 2019, 1:59am

Did you get it to go any faster? I can get it to run—but I have 1000 participants judging 12 different alternatives across 9 scenarios with 4 options per scenario, and the model is taking a while to run. If I limit it to just 30 participants, I can get it to run and the results correlate with other methods at like .99, so I know it’s working.

JustinYap · August 23, 2019, 3:36am

Not really, unfortunately you’ll just have to wait for it to finish

tnearey · August 26, 2019, 9:50pm

Good. This explains my confusion over C and K above.
Your original post keeps the first two lines of Feit’s hmnl.stan:

int<lower=2> C; // Number of alternatives (choices) in each scenario
int<lower=1> K; // Number of covariates of alternatives

You changed that in your revised version, with good reason.
But you’re using a different kind of design matrices than in the vanHorn/Feit code. https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial/blob/master/3_hmnl/hmnl.stan

In the ‘chocolate’ example they provide,
the first C x K corresponding to X[1,1] in the stan data
is packed in R as

choc.standata$X[1,1,,]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]  0.6    1    0    0    0    0    0    1    0
[2,]  0.7    0    0    1    0    1    0    0    0
[3,]  3.6    1    0    0    0   -1   -1   -1   -1

markhw · October 3, 2019, 2:34am

Interesting. I’m running a design with 350 respondents, 13 tasks, 4 alternatives per task, and 13 alternatives in total, and using that code above takes multiple hours to run. Have you not gotten it to speed up from that?

Topic		Replies	Views
Hierarchical Multinomial Model Optimization/Specification Modeling performance , multinomial-response , hierarchical-model	15	1450	May 7, 2020
My first Stan model - hierarchical logistic regression Modeling cognitive-science	28	8344	June 6, 2017
Rstan much slower than self-coded MCMC Modeling performance	10	2419	January 14, 2022
Mixed/hierarchical logit code optimization Modeling cmdstanr , mixed-model , hierarchical-model , brms	6	1030	April 4, 2022
Troubleshooting a "MaxDiff" hierarchical multinomial regression RStan	4	1894	April 17, 2020

Speeding up a hierarchical multinomial logit model

Related topics