Comparing implementations of mixed logit Bayesian inference

I began looking into Bayesian estimation of mixed multinomial logit models a few months ago. I started with TensorFlow Probability but, for a variety of reasons, moved onto Stan. I’m now collecting my results into a paper comparing runtimes in Stan using NUTS and VB against an implementation of the Allenby-Train method, which uses a combination of Gibbs and Metropolis. I am also comparing against a recent paper that proposes a variational bayes algorithm for mixed multinomial logit to see how it compares with ADVI in Stan.

One of the issues with past comparisons in my field is that the authors made their comparisons based on total run-time and/or reproducing known parameter values. I know many of the authors, so I am not discounting what is excellent work (a few of them developed the mixed logit model). My objective is simply to perform a more complete comparison.

At this point, I am getting reasonably good results from both the Allenby-Train and NUTS methods. However, I do find similar results to them in that the NUTS implementation is quite a bit slower (16 minutes vs. 355 minutes). Comparison according to ESS brings NUTS about in line with Allenby-Train, but I wonder if I can do better. It’s a fairly simple model, but there may be minor tweaks that make a difference. I know that @daniel_guhl did some similar work.

I have a few questions/points of clarification.

  1. My comparison model does not provide ESS. I use the R package MCSE to calculate them. I know it uses a method by Gong and Felgal (2015). Does anyone know how comparable the two ESS are?
  2. In one of the papers I reference (Bansal et al., 2020 - available including code in my github repository) they state that Stan ran slow but could be improved by coding analytical gradients. I am relatively new to the Bayesian field/Stan, but my understanding is that doesn’t make much sense because Stan uses automatic differentiation and various transformations to simplify things in NUTS/ADVI. Analytical gradients wouldn’t help, no? Plus, it defeats the purpose of using probabilistic programming if you have to custom-code everything.
  3. Does anyone have thoughts on whether it is a fair comparison to take the run-time of 4 chains from Stan and compare against time for 1 chain from another package. On the one hand, 4 cores is approximately 4x the computer time. On the other hand, if the other package doesn’t run in parallel then Stan can give 4x as many samples in the same observed time (which is what you really care about as a person sitting at a computer waiting for a model to finish).

Additional details:
I am running the models on a clean Google Cloud Linux environment and using RStan (comparison model is also written in R). I’ve attached a very rough draft, mostly in case anyone is interested in seeing the plots/tables:
Bayes_Comparison_DRAFT.pdf (519.5 KB)
All the code is available in the following repository:

A sample of one of the models (excluding the data block) - there are 2,500 observations in this case. I multiply by a vector of ones [1,5] because I find it easier to use what’s called “wide format” (include all alternatives in the same row) than the indexing method I’ve seen done in similar Stan models.

parameters {
  vector[PF] betan; // hypermeans of the part-worths for fixed parameters with normal priors
  matrix[PR, I] z; // individual random effects (unscaled) (standardized component) for random parameters
  vector<lower=0,upper=pi()/2>[PR] tau_unif; // prior scale
  cholesky_factor_corr[PR] L_Omega; // prior correlation
  vector[PR] gamma; // random coeffs
transformed parameters {
  vector<lower=0>[PR] tau;     // prior scale
  matrix[I, PR] beta_ind;
  for (k in 1:PR) tau[k] = 2.5 * tan(tau_unif[k]);
  beta_ind = rep_matrix(gamma', I) + (diag_pre_multiply(tau,L_Omega) * z)';

model {
  // create a temporary holding vector
  vector[I*T] log_prob; // we will add the log_prob for each scenario and pass it as a single vector to the posterior estimation engine
  vector[K] utils; // vector of utilities for each alternative
  int ind=1; // individual id for a given IT to translate to individual-level heterogeneity for multi-task dataset
  row_vector[K] ones = rep_row_vector(1, K);
  // priors on the parameters
  to_vector(z) ~ std_normal();
  betan ~ normal(0, 5);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(gamma) ~ normal(0, 5);

  // log probabilities of each choice in the dataset
  for(i in 1:IT) {
      utils[1] = betan[1]*X[i,5]+betan[2]*X[i,6]+betan[3]*X[i,7]+betan[4]*X[i,8]+betan[5]*X[i,9]+betan[6]*X[i,10]+betan[7]*X[i,11]+
      utils[2] = betan[1]*X[i,16]+betan[2]*X[i,17]+betan[3]*X[i,18]+betan[4]*X[i,19]+betan[5]*X[i,20]+betan[6]*X[i,21]+betan[7]*X[i,22]+
      utils[3] = betan[1]*X[i,27]+betan[2]*X[i,28]+betan[3]*X[i,29]+betan[4]*X[i,30]+betan[5]*X[i,31]+betan[6]*X[i,32]+betan[7]*X[i,33]+
      utils[4] = betan[1]*X[i,38]+betan[2]*X[i,39]+betan[3]*X[i,40]+betan[4]*X[i,41]+betan[5]*X[i,42]+betan[6]*X[i,43]+betan[7]*X[i,44]+
      utils[5] = betan[1]*X[i,49]+betan[2]*X[i,50]+betan[3]*X[i,51]+betan[4]*X[i,52]+betan[5]*X[i,53]+betan[6]*X[i,54]+betan[7]*X[i,55]+
      log_prob[i] = ones * (log_softmax(utils) .* choice[i,1:K]');
      if(fmod(i, T)==0){
        ind = ind + 1;
  target += log_prob';

Why not apply the monitor function to the samples from the comparison model?

My understanding is that any MCMC algorithm should be run as multiple chains to ensure there is no influence of the initialization context. Is this provably not a concern with your comparison method?

Which model from the repo is this? I don’t have time to look at it in any detail until the weekend, but two notes:

  • your model implies a cauchy prior on tau; you’re using the more efficient tan-uniform parameterization, but I believe that recent advice is to still avoid these unless they are strongly motivated by concrete theory. Previously folks used heavy-tailed priors to (I think) attempt to make things more “robust” in some sense by allowing more mass to extreme values. However, it turns out that this causes a posterior geometry that is very difficult to sample, causing Stan to degrade in performance dramatically. (It is possible that your alternate methods appear to sample more efficiently because they are not as sensitive to the true geometry and therefore seeming to perform better while being less accurate. Note I say “possible” because I can’t speak likelihood; just wanted to put this possibility on your radar)
  • usually find that in hierarchical models there is often a lot of redundant computation in the more obvious approaches to coding them. I can’t confirm this is the case with your data until I discern which dataset/model from the repo you’re talking about, but in the meantime you could check out this demo of what I mean by redundant computation and how to eliminate it. Arguably this might induce an unfair advantage when then comparing against a non-Stan sampler, so you’d probably want to implement the same tricks in whatever language those use for expressing the model too.

Thanks for the suggestion. It looks like I should be able to apply this to get comparable ESS.

It’s the model “synth_model”

Thanks for the note. I was working off this post.

Thanks for the feedback. I’ll have a look through your discussion and see if it helps. It sounds applicable on first reading. The comparison model is fairly boilerplate. You don’t have much control over the code. I imagine I wouldn’t change it (would have to go into their source code, which I assume is in some form of Rcpp) because the point is to compare what is possible to accomplish in a flexible language like Stan vs. the conventional implementation. Thus, why I use the LKJ and tricks in prior specification whereas the comparison model is just a given prior (you can change the mean/variance).

I’ve made a few small adjustments to my code. I changed the prior on \tau to just N(0,2) and changed the way I reference individuals with multiple observations. The code still runs quite slow. Looking at the user manual, my model is non-centered. My parameters are basically distributed as multivariate normal. Should I be changing them based on this code rather than the this version. I want some version of a Wishart prior on the covariances. Is there any additional vectorizing I could do?
@mike-lawrence you standardized your variables in the linked code, which fits with the recommendation in the efficiency tuning section of the user manual. I suppose I could also do it with my code and convert it back after.