Mixed Logit Model


@jeremy.koster - Could you provide some summary stats re: your data? How often do fisherman choose the same spots? Do the same people go to the same places or do they alternate? Do they often go to particular places around the same time or do they scatter across the 10 possible sites? If they coalesce in particular places at one (or a few) at a time, is there a pattern to which these places are (e.g. weather, fish migration patterns, etc.)?

There are a number of likely helpful models in economics for your problem, but which one depends on the answers to the above. I’d be happy to suggest something more concrete upon hearing more :)


Thanks, Jim. If you and Bob or others know researchers who are well-versed in good models for these data and open to dabbling in an anthropological problem, I’d be glad to connect them with the lead investigators on the project. I’d love to see more choice models in anthropology, which doesn’t have much experience with such approaches.


Thanks shoshievass,

As a disclaimer, the data were collected by a researcher in Grenada, and I am not the one who is best-positioned to summarize the data. That said, the basic details are that all fishing activities by approximately 13 boat captains were documented for about one year. The captains have the option of choosing from about 9 fishing sites, but there are rules about the order in which boats can fish at these sites, largely determined by queuing according to time of arrival. The outcomes (n = 998) of each fishing episode are known (in terms of how much fish were caught by boat j) – which is a positively skewed distribution – along with the date and time and the length of the fishing episode. But there is no way to know what the expected productivity of unexploited sites would be at time t. The sites are all close enough that weather should be similar across sites, but there may be ecological aspects of the bays that mean some sites are more or less productive at different times.

After fishing at a site, captains have the option of getting in line for another turn at that site, or switching to a different site. Based on what I’ve seen of the data, these decisions could be reconstructed (with some moderate coarseness in the time frames at which these decisions were documented). I haven’t looked too closely at summary statistics, but according to the researchers, it seems common for captains to cluster together at productive sites . . . unless things get too crowded, at which point some will disperse to look for less crowded sites. (Fortunately, it seems that there is no fishing by other captains who were not in the study population, so the distribution of boats at any particular time is knowable, albeit with some measurement uncertainty.)


@richard_mcelreath, author of Statistical Rethinking, is an anthropologist. He’ll probably know what’s up in Anthro, though this seems like a very generic kind of sequential choice problem, which is hardly to say that it’s easy (any inference involving projected discrete choice can get hard due to combinatorics and require dynamic programming algorithms rather than brute force).


Thanks, Bob.

Actually, Richard is a co-author of mine, and these data arrived on my desk via another mutual colleague from our corner of anthropology. In general, choice models are not widely used in our field, but it’s possible that some of my colleagues may have connections to ecologists who have done work on analogous topics (perhaps including the Hidden Markov models that you had recommended previously).


I guess that’s not too surprising given that we’re not exactly all here at random.


I’ll respond properly as soon as I get a chance! (Probably over the weekend - sorry; a bit of a hectic time for me!)


Hello everyone, I just bumped into this blog following the suggestion from jeremy.koster (thanks Jeremy!). I learned a lot from the code shared here (thanks Jim!). One question: has anyone implemented in Stan a model with utility in the WTP-space so that coefficients can be interpreted directly as marginal WTPs for choice attributes, thereby avoiding the problem of indefiniteness of (some) distributions of ratios of random variables?


I am resurrecting this thread because the colleague with fishing data from Grenada (described above) is interested in making a push on the data analysis. I remain convinced that a conditional logit would be a good option for modeling the data.

For someone with prior experience modeling choices in Rstan, I anticipate that the modeling would be reasonably straightforward, and the data could be readied so that the modeling itself is the only remaining task. The main research question is the extent to which the number of other captains already in a particular bay leads the fishing captains to choose among the 10 or so bays where they can fish in Grenada.

In terms of impact, it’s not clear yet where this analysis will eventually get published. If it were an ecological journal (something like Behavioral Ecology perhaps), it would be noteworthy for being one of the first applications of discrete choice models to data of this kind.

Among the users of the board (e.g., @shoshievass and @James_Savage), are there recommendations for potential collaborators?



I haven’t seen anyone do this, but was working on something very similar for some auction analysis recently. Will let you know when I get around to it!


Hey @jeremy.koster! I could be persuaded. Quite a few side-projects at the moment, but if it is as you say down to getting a mixed logit model running, that should be doable. Best email for me is javage@gmail.com



Hi Ric,
I used such a model as an example at the eRum conference in Poznan 2 years ago. See slide 14 and 16 of this presentation (eRum2016_GG.pdf (1.5 MB)). I’ll post something here, but first I need to find the old code…
I’m very interested in choice models in WTP space and would love to get some feedback regarding the implementation.



Dear Jim and Daniel (or even Bob),
I am very new to Stan but inspired by Jim and Daniel’s codes for mxl logit I have tried a baby code for a MXL logit in WTP-space. I tested it with synthetic data and it seems to work ok, but quite slowly probably because of the loops. I wonder if you can give me some general suggestions, especially about how to vectorise it to gain efficiency.

Code is below:

mxlwtp = "
// ***************** declare dimensions of variables to be received by R
data { 
int<lower = 1> N ;                   // number of choices 
int<lower = 1> K ;                   // number of attributes including price
int<lower = 2> J ;                   // number of alts.
int<lower = 1> I ;                   // number of resp.
int<lower = 1 , upper = J> y[N] ;    // dep var
int<lower = 1 , upper = I> id[N] ;   // resp . index
matrix[N*J, K] x;

// ***************** declare dimensions of parameters
parameters { 
vector[K] mu ;                      // population means for random utility coeff.
cholesky_factor_corr[K] Lcorr;      // cholesky for correlation in random utility coeff. 
vector<lower=0>[K] sigma;           // variances for random utility coeff. (diag. of cov.)
matrix<lower=-4,upper=4>[K,I] zeta; // random part of individual utility coeff.
// **************** priors (numbers are odd at the minute)
model {
vector[J] utilities; 
matrix[K,I] beta ;                  // individual utility coeff. (Kth is scale)
mu[1:K-1] ~ normal(0,100) ;         // priors for population means of random attribute coef.
mu[K] ~ normal(0.1,2);              // priors for population means of random log. price coef.
Lcorr ~ lkj_corr_cholesky(1);       // priors for cholesky-correlation across all random coef.
sigma ~ cauchy(0,10);               // priors for variance
// ****** loop over respondents: each loop gets rnd utility coef. and Kth is scale
for ( i in 1:I ) { for ( k in 1:K ) { zeta[k,i] ~ normal(0,1); }
beta[1:K,i] = mu + diag_pre_multiply(sigma, Lcorr) * zeta[1:K,i] ;
// ****** loop over choices: each loop compute utility and choice probability
for (n in 1:N) { 
utilities = x[((n-1)*J+1):(n*J),1:K-1] * beta[1:K-1,id[n]]; 
utilities += (utilities + x[((n-1)*J+1):(n*J),K]) * exp(beta[K,id[n]]);
y[n] ~ categorical_logit(utilities);
// get correlation, var-covar matrices, and computes individual specific wtps
generated quantities {
matrix[K-1,I] beta ;  // individual utility coeff. (wtps for attributes in this case)
corr_matrix[K] Omega;
cov_matrix[K] Sigma;
Omega = multiply_lower_tri_self_transpose(Lcorr);  // correlation matrix
Sigma = quad_form_diag(Omega,sigma);               // var-covar matrix
for ( i in 1:I ) { 
// remember to scale back by 10
beta[1:K-1,i] = mu + diag_pre_multiply(sigma, Lcorr) * zeta[1:K-1,i] * 10 ;


The Lcorr ~ lkj_corr_cholesky(1) is just a uniform distribution, so it’s not doing anything.

We don’t generally recommend the super-wide priors you’re providing unless you expect mu and sigma to be on the scales defined. We strongly disrecommend hard interval contraints, like the one you put on zeta—they can have a hugely biasing effect on the posterior that most of our users don’t expect.

It’d help to indent this:

for ( i in 1:I ) { for ( k in 1:K ) { zeta[k,i] ~ normal(0,1); }
beta[1:K,i] = mu + diag_pre_multiply(sigma, Lcorr) * zeta[1:K,i] ;

to the more scannable

for ( i in 1:I ) {
  for ( k in 1:K )
    zeta[k,i] ~ normal(0,1);
  beta[1:K,i] = mu + diag_pre_multiply(sigma, Lcorr) * zeta[1:K,i] ;

then it’s clearer you can replace it with this:

to_vector(zeta) ~ normal(0, 1);
  matrix[K, K] Lcov = diag_pre_multiply(sigma, Lcorr);  
  for (i in 1:I)
    beta[1:K, i] = mu + Lcov * zeta[1:K, i];

This would then be more efficient if you didn’t need to tear zeta apart the way you do, which would arise if you represent it as a transposed array of vectors, so you could write that last line product as Lcov * zeta_t[i]. That’d defeat the simple vectorization of zeta's prior, but it’s probably worth it, but overall, it’d be a very minor effect compared to only computing Lcov once, which should be a big win.

I didn’t undertand the scaling by 10 in the generated quantities definition of beta. You might as well make beta a transformed parameter since you define it anyway each leapfrog step.


This might be confusing to some. It is a uniform distribution on the correlation matrix Lcorr * transpose(Lcorr) but it is non-uniform on Lcorr and it has to be included in the model block unlike foo ~ uniform(0, 1) and so forth.


Thanks so much for the correction here. I forgot that we’re working with the Cholesky factor.


One more issue please, (for Bob I guess?):

how would I go about implementing a Bernoulli mixture of utility parameters in which some subjects have zero coefficient values (no weight on the utility) and others will have random normal values? Is something like the following working?

utilities = x[((n-1)*J+1):(n*J),1:K-1] * (1 ~ bernoulli(theta[1:K-1,id[n]) ) * beta[1:K-1,id[n]] ;