 Speeding up a hierarchical multinomial logit model

Hi,

I’m trying to speed up the modified hierarchical multinomial logit model shown below. Is there anything I can do to optimize it such as vectorizing the nested for loops? Right now using rstan on my data, 100 iterations with 1 chain takes 340 seconds, and reducing the iterations would have a negative impact on the results. I am comparing the performance against a commercial product (Sawtooth software) which uses Gibbs sampling, and it takes less than a minute to run 20000 iterations. Surely I must be doing something suboptimal with stan that makes it much slower. Any help would be greatly appreciated!

data {
int<lower=2> C; // Number of alternatives (choices) in each scenario
int<lower=1> K; // Number of covariates of alternatives
int<lower=1> R; // Number of respondents
int<lower=1> S; // Number of scenarios per respondent
int<lower=0> G; // Number of respondent covariates
int<lower=1,upper=C> YB[R, S]; // best choices
int<lower=1,upper=C> YW[R, S]; // worst choices
matrix[C, K] X[R, S]; // matrix of attributes for each obs
matrix[G, R] Z; // vector of covariates for each respondent
}

parameters {
matrix[K, R] Beta;
matrix[K, G] Theta;
corr_matrix[K] Omega;
vector<lower=0>[K] tau;
}
transformed parameters {
cov_matrix[K] Sigma = quad_form_diag(Omega, tau);
}
model {
//priors
to_vector(Theta) ~ normal(0, 10);
tau ~ cauchy(0, 2.5);
Omega ~ lkj_corr(2);
//likelihood
for (r in 1:R) {
Beta[,r] ~ multi_normal(Theta*Z[,r], Sigma);
for (s in 1:S) {
YB[r,s] ~ categorical_logit(X[r,s] * Beta[,r]);
YW[r,s] ~ categorical_logit(-X[r,s] * Beta[,r]);
}
}
}

This code is modified from the example in https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial/blob/master/3_hmnl/hmnl.stan

1 Like

With respect to the stan model:

• You should try to use the multi_normal_cholesky instead of multi_normal implementation. The details are explained on p. 150 of the Stan Manual 2.16.
• If S is not very large, your are probably better off using the non-centered parameterization for beta. Chapter 27 of the Stan Manual 2.16 discusses this in detail.
• I am not sure about vectorizing. Maybe you can remove the internal S loop entirely and the s index in YB, YW and X. I have my doubts though, the number of indices makes my head hurt.
• I am a bit concerned with the last two lines. You have a very strong implied constraint on beta that the response to X will be exactly the opposite for the outcomes YB and YW. This might make sense but it is hard for me to say without more context. I would like to see what happens if you comment out any of those line.

I am not part of the stan team but if there are no obvious errors in the code I suspect the stan experts will be suspicious about the comparison with Sawtooth. The typical questions you might expect are.

• Are you sure the models are identical, also the priors?
• What is the n_eff/minute for the gibbs implementation?
• Have you checked whether it recovers the parameters for simulated data?

Further, if you have can finish sampling with stan, diagnostic information such as tree_depth and divergences might help in figuring out what goes wrong. You should report them here. Maybe try a subsample of your dataset?

5 Likes

Thanks for all your suggestions! I will investigate each one.

My dataset isn’t large: R = 302, S = 6

As for the last two lines, removing the last line would make my model the same as the original hierarchical multinomial logit model on which it was based (there is a link to it in my previous post). YB is intended to represent the “best” choice and YW is the “worst” choice, when respondents are asked to choose their most and least favourite alternatives (known as MaxDiff).

I get the warning that I should increase maximum treedepth and when I increase it to from 10 to 12, it runs slower than before. I also get the warning about the estimated Bayesian Fraction of Missing Information being low but I haven’t been able to get rid of it by increasing the number of iterations. I almost never get the warning about divergent transitions.

First, you should take @stijn’s advice and use a multi_normal_cholesky. You should also vectorize the call to multi_normal_cholesky—that way your covariance matrix is only factored once.

Also as @stijn points out, if the model’s misspecified for the data (not a coding error, just not a good model for the data), it can be hard for Stan to sample.

You only want to compute X[r, s] * Beta[ , r] once—store it and reuse the result. I don’t think we’ve vectorized categorical logit, but there’s not much shared computation here. Just like @stijn, I’d be worried about those last two linked regressions.

You want to turn Theta * Z into a matrix multiplication, then break into an array to feed into multi_normal_cholesky.

Pinning one of the categoical logit parameters to zero by pinning the relevant coefficient to zero is often done to induce identifiability—the K-value parameterization is only identified through the prior.

1 Like

Thanks Bob. I have implemented all your suggestions as well as @stijn’s and it is more than twice as fast now! The results make sense and predictive accuracy is high so I don’t think there is an issue with the model specification. My current code is below:

data {
int<lower=2> C; // Number of alternatives (choices) in each scenario
int<lower=1> K; // Number of alternatives
int<lower=1> R; // Number of respondents
int<lower=1> S; // Number of scenarios per respondent
int<lower=1,upper=C> YB[R, S]; // best choices
int<lower=1,upper=C> YW[R, S]; // worst choices
matrix[C, K] X[R, S]; // matrix of attributes for each obs
}

parameters {
vector[K] Beta[R];
vector[K - 1] Theta_raw;
cholesky_factor_corr[K] L_Omega;
vector<lower=0, upper=pi()/2>[K] L_sigma_unif;
}

transformed parameters {
vector<lower=0>[K] L_sigma;
matrix[K, K] L_Sigma;
vector[C] XB[R, S];
vector[K] Theta;

for (k in 1:K) {
L_sigma[k] = 2.5 * tan(L_sigma_unif[k]);
}

L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

Theta = 0;
for (k in 1:(K-1)) {
Theta[k + 1] = Theta_raw[k];
}

for (r in 1:R) {
for (s in 1:S) {
XB[r,s] = X[r,s] * Beta[r];
}
}
}

model {
//priors
Theta_raw ~ normal(0, 10);
L_Omega ~ lkj_corr_cholesky(4);

//likelihood
Beta ~ multi_normal_cholesky(Theta, L_Sigma);
for (r in 1:R) {
for (s in 1:S) {
YB[r,s] ~ categorical_logit(XB[r,s]);
YW[r,s] ~ categorical_logit(-XB[r,s]);
}
}
}
1 Like

Hello Justin,
I am new to Stan and found this post most interesting, I assume it is related to your paper on trick logit in Displayr, which I also found very useful. I wonder if I can be given access to a small sub-sample of the data structure used to run di Stan code (I use Rstan) to better understand its mechanics.
Thanks
Ric

Sure @Ric , please have a look at the unit tests in https://github.com/Displayr/flipMaxDiff/blob/master/tests/testthat/test-hierarchicalbayes.R

Minor point: I think the comment on the Data section line
int<lower=1> K; // Number of alternatives
should read “// Number of attributes”.

1 Like

Re: “Pinning one of the categoical logit parameters to zero by pinning the relevant coefficient to zero is often done to induce identifiability—the K-value parameterization is only identified through the prior.”

“Engineering” the identifibility issue generally requires more than just pinning one of the coefficients to zero. I think there may be confusion of K here (where it is the number columns of the X matrices) with the K of [https://github.com/stan-dev/stan/issues/266] (were it represents the number of choice categories, corresponding to C in @JustinYap’s code ).
Typically, subsets of the K columns of the X matrices are associated with specific categories and K is an integer (P) multiple of C (or of C-1, with identifiability constraints imposed). So the columns of X represent something like a vec(matrix[C-1,P]) in @ariddel’s remark at [https://github.com/stan-dev/stan/issues/266#issuecomment-50351532].

The html slides in unit 2 of [https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial] suggest that “effects coding” (aka “sum-to-zero coding” or in R as simply “sum coding”) of categorical contrasts may be better for symmetrical priors ( a concern in [https://github.com/stan-dev/stan/issues/266#issue-20713096] ) than pinning one reference category’s coefficients to zero (; they cite [http://bayesium.com/a-symmetric-effects-prior/] for details). This is taken care of in the ART-Forum tutorial examples in units 2 and 3 by constructing design matrices corresponding to X with only C-1 columns per attribute (plus C-1 intercepts). This effectively sets the coefficients for the C-th category to minus the sum of the coefficients of the other C-1 for each of the P ‘families’ of coefficients in K.

1 Like

Thanks for your comments @tnearey, in my case I think of K as the number of alternatives. I will try out effects coding.

Justin, did you come to a final model that worked and ran well for your best-worst purposes? If so, could you show a template for this analysis? I am interested it using it to calculate individual-level estimates from a best-worst scaling deign.

I’m slowly working through the snippet above, and matrix[C, K] X[R, S]; looks to be for covariates? If I just want to calculate utility estimates for each individual-item combination, I don’t need this, do I?

The one I used was more or less the same as the one I last posted. I haven’t implemented anything for covariates for this. matrix[C, K] X[R, S] just contains the design. The individual-level estimates are beta

1 Like

Interesting. What do you mean by the design? That is, a 1 if the item was included and a 0 if it wasn’t? I can’t square how that leads to a C by K matrix of R by S arrays. Would it be too much to ask for you to provide an example of data you would feed to this script? I’d appreciate it.

Yes, 1 if it is included and 0 if it isn’t. Lets say there are 10 alternatives and but only 3 choices in each scenario, e.g. 2,3,7. Then C = 3 and K = 10 and the matrix would be

0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0

1 Like

Thank you! Sorry, last clarification question: in YB[R, S] and YW[R, S], do the choices refer to the row of each matrix[C, K] in X[R, S]? So if the design for X[1, 1] was:

0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0

And the best choice was the third alternative, would the integer in YB[1, 1] be 2, since it is the second row?

I first tried to give it an integer that represents the alternative, but, for example, if I wanted to say alternative 7 was the worst, this is above C and got rejected by my sampling call to rstan::stan() because of it. I assume I would say 3 if I wanted to say that 7 was the worst?

That is correct @markhw

1 Like

Did you get it to go any faster? I can get it to run—but I have 1000 participants judging 12 different alternatives across 9 scenarios with 4 options per scenario, and the model is taking a while to run. If I limit it to just 30 participants, I can get it to run and the results correlate with other methods at like .99, so I know it’s working.

Not really, unfortunately you’ll just have to wait for it to finish

Good. This explains my confusion over C and K above.
Your original post keeps the first two lines of Feit’s hmnl.stan:

int<lower=2> C; // Number of alternatives (choices) in each scenario
int<lower=1> K; // Number of covariates of alternatives

You changed that in your revised version, with good reason.
But you’re using a different kind of design matrices than in the vanHorn/Feit code. https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial/blob/master/3_hmnl/hmnl.stan

In the ‘chocolate’ example they provide,
the first C x K corresponding to X[1,1] in the stan data
is packed in R as

choc.standata\$X[1,1,,]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]  0.6    1    0    0    0    0    0    1    0
[2,]  0.7    0    0    1    0    1    0    0    0
[3,]  3.6    1    0    0    0   -1   -1   -1   -1

Interesting. I’m running a design with 350 respondents, 13 tasks, 4 alternatives per task, and 13 alternatives in total, and using that code above takes multiple hours to run. Have you not gotten it to speed up from that?