Speeding up a hierarchical multinomial logit model


#1

Hi,

I’m trying to speed up the modified hierarchical multinomial logit model shown below. Is there anything I can do to optimize it such as vectorizing the nested for loops? Right now using rstan on my data, 100 iterations with 1 chain takes 340 seconds, and reducing the iterations would have a negative impact on the results. I am comparing the performance against a commercial product (Sawtooth software) which uses Gibbs sampling, and it takes less than a minute to run 20000 iterations. Surely I must be doing something suboptimal with stan that makes it much slower. Any help would be greatly appreciated!

data {
  int<lower=2> C; // Number of alternatives (choices) in each scenario
  int<lower=1> K; // Number of covariates of alternatives
  int<lower=1> R; // Number of respondents
  int<lower=1> S; // Number of scenarios per respondent
  int<lower=0> G; // Number of respondent covariates 
  int<lower=1,upper=C> YB[R, S]; // best choices
  int<lower=1,upper=C> YW[R, S]; // worst choices
  matrix[C, K] X[R, S]; // matrix of attributes for each obs
  matrix[G, R] Z; // vector of covariates for each respondent
}

parameters {
  matrix[K, R] Beta;
  matrix[K, G] Theta;
  corr_matrix[K] Omega;
  vector<lower=0>[K] tau;
}
transformed parameters {
  cov_matrix[K] Sigma = quad_form_diag(Omega, tau);
}
model {
  //priors
  to_vector(Theta) ~ normal(0, 10);
  tau ~ cauchy(0, 2.5); 
  Omega ~ lkj_corr(2);
  //likelihood
  for (r in 1:R) {
    Beta[,r] ~ multi_normal(Theta*Z[,r], Sigma);
    for (s in 1:S) {
      YB[r,s] ~ categorical_logit(X[r,s] * Beta[,r]);
      YW[r,s] ~ categorical_logit(-X[r,s] * Beta[,r]);
    }
  }
}

This code is modified from the example in https://github.com/ksvanhorn/ART-Forum-2017-Stan-Tutorial/blob/master/3_hmnl/hmnl.stan


#2

With respect to the stan model:

  • You should try to use the multi_normal_cholesky instead of multi_normal implementation. The details are explained on p. 150 of the Stan Manual 2.16.
  • If S is not very large, your are probably better off using the non-centered parameterization for beta. Chapter 27 of the Stan Manual 2.16 discusses this in detail.
  • I am not sure about vectorizing. Maybe you can remove the internal S loop entirely and the s index in YB, YW and X. I have my doubts though, the number of indices makes my head hurt.
  • I am a bit concerned with the last two lines. You have a very strong implied constraint on beta that the response to X will be exactly the opposite for the outcomes YB and YW. This might make sense but it is hard for me to say without more context. I would like to see what happens if you comment out any of those line.

I am not part of the stan team but if there are no obvious errors in the code I suspect the stan experts will be suspicious about the comparison with Sawtooth. The typical questions you might expect are.

  • Are you sure the models are identical, also the priors?
  • What is the n_eff/minute for the gibbs implementation?
  • Have you checked whether it recovers the parameters for simulated data?

Further, if you have can finish sampling with stan, diagnostic information such as tree_depth and divergences might help in figuring out what goes wrong. You should report them here. Maybe try a subsample of your dataset?


#3

Thanks for all your suggestions! I will investigate each one.

My dataset isn’t large: R = 302, S = 6

As for the last two lines, removing the last line would make my model the same as the original hierarchical multinomial logit model on which it was based (there is a link to it in my previous post). YB is intended to represent the “best” choice and YW is the “worst” choice, when respondents are asked to choose their most and least favourite alternatives (known as MaxDiff).

I get the warning that I should increase maximum treedepth and when I increase it to from 10 to 12, it runs slower than before. I also get the warning about the estimated Bayesian Fraction of Missing Information being low but I haven’t been able to get rid of it by increasing the number of iterations. I almost never get the warning about divergent transitions.


#4

First, you should take @stijn’s advice and use a multi_normal_cholesky. You should also vectorize the call to multi_normal_cholesky—that way your covariance matrix is only factored once.

Also as @stijn points out, if the model’s misspecified for the data (not a coding error, just not a good model for the data), it can be hard for Stan to sample.

You only want to compute X[r, s] * Beta[ , r] once—store it and reuse the result. I don’t think we’ve vectorized categorical logit, but there’s not much shared computation here. Just like @stijn, I’d be worried about those last two linked regressions.

You want to turn Theta * Z into a matrix multiplication, then break into an array to feed into multi_normal_cholesky.

Pinning one of the categoical logit parameters to zero by pinning the relevant coefficient to zero is often done to induce identifiability—the K-value parameterization is only identified through the prior.


Hierarchical multinomial model
#5

Thanks Bob. I have implemented all your suggestions as well as @stijn’s and it is more than twice as fast now! The results make sense and predictive accuracy is high so I don’t think there is an issue with the model specification. My current code is below:

data {
  int<lower=2> C; // Number of alternatives (choices) in each scenario
  int<lower=1> K; // Number of alternatives
  int<lower=1> R; // Number of respondents
  int<lower=1> S; // Number of scenarios per respondent
  int<lower=1,upper=C> YB[R, S]; // best choices
  int<lower=1,upper=C> YW[R, S]; // worst choices
  matrix[C, K] X[R, S]; // matrix of attributes for each obs
}

parameters {
  vector[K] Beta[R];
  vector[K - 1] Theta_raw;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0, upper=pi()/2>[K] L_sigma_unif;
}

transformed parameters {
  vector<lower=0>[K] L_sigma;
  matrix[K, K] L_Sigma;
  vector[C] XB[R, S];
  vector[K] Theta;

  for (k in 1:K) {
    L_sigma[k] = 2.5 * tan(L_sigma_unif[k]);
  }

  L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

  Theta[1] = 0;
  for (k in 1:(K-1)) {
    Theta[k + 1] = Theta_raw[k];
  }

  for (r in 1:R) {
    for (s in 1:S) {
      XB[r,s] = X[r,s] * Beta[r];
    }
  }
}

model {
  //priors
  Theta_raw ~ normal(0, 10);
  L_Omega ~ lkj_corr_cholesky(4);

  //likelihood
  Beta ~ multi_normal_cholesky(Theta, L_Sigma);
  for (r in 1:R) {
    for (s in 1:S) {
      YB[r,s] ~ categorical_logit(XB[r,s]);
      YW[r,s] ~ categorical_logit(-XB[r,s]);
    }
  }
}