I began looking into Bayesian estimation of mixed multinomial logit models a few months ago. I started with TensorFlow Probability but, for a variety of reasons, moved onto Stan. I’m now collecting my results into a paper comparing runtimes in Stan using NUTS and VB against an implementation of the Allenby-Train method, which uses a combination of Gibbs and Metropolis. I am also comparing against a recent paper that proposes a variational bayes algorithm for mixed multinomial logit to see how it compares with ADVI in Stan.

One of the issues with past comparisons in my field is that the authors made their comparisons based on total run-time and/or reproducing known parameter values. I know many of the authors, so I am not discounting what is excellent work (a few of them developed the mixed logit model). My objective is simply to perform a more complete comparison.

At this point, I am getting reasonably good results from both the Allenby-Train and NUTS methods. However, I do find similar results to them in that the NUTS implementation is quite a bit slower (16 minutes vs. 355 minutes). Comparison according to ESS brings NUTS about in line with Allenby-Train, but I wonder if I can do better. It’s a fairly simple model, but there may be minor tweaks that make a difference. I know that @daniel_guhl did some similar work.

**I have a few questions/points of clarification.**

- My comparison model does not provide ESS. I use the R package MCSE to calculate them. I know it uses a method by Gong and Felgal (2015). Does anyone know how comparable the two ESS are?
- In one of the papers I reference (Bansal et al., 2020 - available including code in my github repository) they state that Stan ran slow but could be improved by coding analytical gradients. I am relatively new to the Bayesian field/Stan, but my understanding is that doesn’t make much sense because Stan uses automatic differentiation and various transformations to simplify things in NUTS/ADVI. Analytical gradients wouldn’t help, no? Plus, it defeats the purpose of using probabilistic programming if you have to custom-code everything.
- Does anyone have thoughts on whether it is a fair comparison to take the run-time of 4 chains from Stan and compare against time for 1 chain from another package. On the one hand, 4 cores is approximately 4x the computer time. On the other hand, if the other package doesn’t run in parallel then Stan can give 4x as many samples in the same observed time (which is what you really care about as a person sitting at a computer waiting for a model to finish).

**Additional details:**

I am running the models on a clean Google Cloud Linux environment and using RStan (comparison model is also written in R). I’ve attached a very rough draft, mostly in case anyone is interested in seeing the plots/tables:

Bayes_Comparison_DRAFT.pdf (519.5 KB)

All the code is available in the following repository:

A sample of one of the models (excluding the data block) - there are 2,500 observations in this case. I multiply by a vector of ones [1,5] because I find it easier to use what’s called “wide format” (include all alternatives in the same row) than the indexing method I’ve seen done in similar Stan models.

```
parameters {
vector[PF] betan; // hypermeans of the part-worths for fixed parameters with normal priors
matrix[PR, I] z; // individual random effects (unscaled) (standardized component) for random parameters
vector<lower=0,upper=pi()/2>[PR] tau_unif; // prior scale
cholesky_factor_corr[PR] L_Omega; // prior correlation
vector[PR] gamma; // random coeffs
}
transformed parameters {
vector<lower=0>[PR] tau; // prior scale
matrix[I, PR] beta_ind;
for (k in 1:PR) tau[k] = 2.5 * tan(tau_unif[k]);
beta_ind = rep_matrix(gamma', I) + (diag_pre_multiply(tau,L_Omega) * z)';
}
model {
// create a temporary holding vector
vector[I*T] log_prob; // we will add the log_prob for each scenario and pass it as a single vector to the posterior estimation engine
vector[K] utils; // vector of utilities for each alternative
int ind=1; // individual id for a given IT to translate to individual-level heterogeneity for multi-task dataset
row_vector[K] ones = rep_row_vector(1, K);
// priors on the parameters
to_vector(z) ~ std_normal();
betan ~ normal(0, 5);
L_Omega ~ lkj_corr_cholesky(2);
to_vector(gamma) ~ normal(0, 5);
// log probabilities of each choice in the dataset
for(i in 1:IT) {
utils[1] = betan[1]*X[i,5]+betan[2]*X[i,6]+betan[3]*X[i,7]+betan[4]*X[i,8]+betan[5]*X[i,9]+betan[6]*X[i,10]+betan[7]*X[i,11]+
beta_ind[ind,1]*X[i,12]+beta_ind[ind,2]*X[i,13]+beta_ind[ind,3]*X[i,14]+beta_ind[ind,4]*X[i,15];
utils[2] = betan[1]*X[i,16]+betan[2]*X[i,17]+betan[3]*X[i,18]+betan[4]*X[i,19]+betan[5]*X[i,20]+betan[6]*X[i,21]+betan[7]*X[i,22]+
beta_ind[ind,1]*X[i,23]+beta_ind[ind,2]*X[i,24]+beta_ind[ind,3]*X[i,25]+beta_ind[ind,4]*X[i,26];
utils[3] = betan[1]*X[i,27]+betan[2]*X[i,28]+betan[3]*X[i,29]+betan[4]*X[i,30]+betan[5]*X[i,31]+betan[6]*X[i,32]+betan[7]*X[i,33]+
beta_ind[ind,1]*X[i,34]+beta_ind[ind,2]*X[i,35]+beta_ind[ind,3]*X[i,36]+beta_ind[ind,4]*X[i,37];
utils[4] = betan[1]*X[i,38]+betan[2]*X[i,39]+betan[3]*X[i,40]+betan[4]*X[i,41]+betan[5]*X[i,42]+betan[6]*X[i,43]+betan[7]*X[i,44]+
beta_ind[ind,1]*X[i,45]+beta_ind[ind,2]*X[i,46]+beta_ind[ind,3]*X[i,47]+beta_ind[ind,4]*X[i,48];
utils[5] = betan[1]*X[i,49]+betan[2]*X[i,50]+betan[3]*X[i,51]+betan[4]*X[i,52]+betan[5]*X[i,53]+betan[6]*X[i,54]+betan[7]*X[i,55]+
beta_ind[ind,1]*X[i,56]+beta_ind[ind,2]*X[i,57]+beta_ind[ind,3]*X[i,58]+beta_ind[ind,4]*X[i,59];
log_prob[i] = ones * (log_softmax(utils) .* choice[i,1:K]');
if(fmod(i, T)==0){
ind = ind + 1;
}
}
target += log_prob';
}
```