Mixed Logit Model


Hi Tom,

cool, I guess there are many ways to implement this model. However, an outside option might also have other purposes, such as being able to model purchase/choice incidence effects (i.e., entering a market or not).

Quick question: Why would you set the utility to 0 for alternatives that were not available?
I assume you want to set it to a large negative value so the corresponding probability approaches 0.
Maybe something like utilities[n] = utilities[n] - 1000 * Z[n]; and Z[n] is now an index for unavailable alternatives? To be honest, I guess there are more elegant ways to do this.



Hi Daniel

You’re right of course.

The index is not about theoretical availability but if an option is viable for an individual. The idea is that you only choose products that are also viable to you (viable options are a subset of available options). I haven’t thought about defining utilities in such a way. I was thinking about an easy way to implement this and came up with the idea of using vector Z which sets unavailable options to zero (which is only relevant for the denominator in the softmax). Using Z[n] as a complement hasn’t crossed my mind yet. I ran a quick test and the results are almost identical.



This idea (massively negative utilities for unavailable products) is a touch inefficient.

You can very simply include varying choice dimensions in the code I included, by simply having start and end indices by choice set and dynamically setting the size of utilities with respect to the choice set.


Just came across a really cool paper by Alex Peysakhovich and John Ugander https://dl.acm.org/citation.cfm?doid=3106723.3106731 that attempts to resolve the problem you mention (having individual preferences vary over time.

They do it using neural networks to learn a representation of the relevant feature matrix and “context”. But neural networks are just functional approximations. So you could do something similar by saying

beta_it = beta_i*exp(Delta * context_matrix_it)

with beta_i ~ multi normal (beta, Sigma)

You’d center the context matrix around 0 and put sparsity-inducing priors on Delta, so that the model would reduce to standard mixed logit if context doesn’t matter.

which would retain interpretability of how context affects preferences for various attributes.

Hope this helps,


Yes, handling varying choice sets using indexing is the better solution. I guess the “large negative utility” hack comes from a time when it was NOT easy to implement a specific model, and only software packages with limited capabilities were available.

I guess my comment was (at least partially) a conceptual one. If you want low probabilities, the corresponding utilities must be very “low” (i.e., large negative value). The consequence of setting the utility to 0 is difficult to predict because it also depends on the utilities of the other (available) alternatives. It seemed it was ok(-ish) for Tom to have utilities of 0?!?

BTW: Thank you, Jim, for your contributions regarding discrete choice models and Stan. I appreciate that!



Hi James,

Thanks for the wonderful example. You helped me figure out a solution for solving a choice problem where the number of choices varies by person and by situation (using your start and end index vectors, nice!). I had a few questions related to this:

(1) you put rows of zeroes in your X matrix for the “outside choice”. I’m a little unclear about this. Can you explain how the zeros in X normalize the person level parameters? (if that is what is going on). Along those same lines, if you didn’t have X2, would you not need those zeros?

(2) How would you suggest constructing pointwise log likelihood vector in generated quantities to do WAIC/LOO calculations outside of Stan? I’m having trouble translating your
target += log_prob’ * choice;
log_lik = categorical_logit_lpmf(choice, log_prob)

I guess I could include an integer version of choice as separate data, but that seems like a hack.

thanks for help,


(1) you put rows of zeroes in your X matrix for the “outside choice”. I’m a little unclear about this. Can you explain how the zeros in X normalize the person level parameters? (if that is what is going on). Along those same lines, if you didn’t have X2, would you not need those zeros?

  • Yep. Remember the decision rule is “make the choice that maximises utility”. If I add some number to all utilities across available choices, then the choice remains the same, but the utilities change. So you need to anchor the utilities to something. We do this with an “outside good”-- typically, the decision to make none of the available choices. By custom, the utility of the outside good is 0. Recall that choice probabilities for person i, good j assuming an iid Gumbel idiosyncratic utility are
p_{ij} = \frac{\exp(u_{ij})}{\sum_{k=1}^{J+1} \exp(u_{ik})}

well if our outside good gives u_{iJ+1}= 0 then

p_{ij} = \frac{\exp(u_{ij})}{1 + \sum_{k=1}^{J} \exp(u_{ik})}

and all the parameters in our utility functions become interpretable as log on the impact of the choice attribute with respect to making no choice at all.

(2) How would you suggest constructing pointwise log likelihood vector in generated quantities to do WAIC/LOO calculations outside of Stan? I’m having trouble translating your
target += log_prob’ * choice;
log_lik = categorical_logit_lpmf(choice, log_prob)

Remember that a categorical likelihood contribution for an observation across k choices where x_k=1 indicates the k-th choice is made is

L(x | p) = \prod_{k = 1}^{K} p_{k}^{x_{k}}

so the log likelihood contribution of an observation

\log(L(x|p)) = \sum_{k=1}^{K} = x_{k}\log(p_{k}) = \log(p_{k})

The notation I use to calculate the log likelihood of the full sample is just choice for person 1 * log probability for person 1's choice + ... + choice for person N * log probability for that choice which is just the dot product of a binary vector of choices and the log probability of those choices.

The pointwise log likelihood, for loo, is just the log probability at the given choice.

Hope this helps!


I didn’t answer the second part of your first question. You need the outside good utility to be fixed whether you have individual demographics or not.


Your comments help a lot, thanks! Do you usually augment your data matrix with zeros to normalize utility? I’ve only seen the “treat the first choice as the base case” type, which I think is an alternative specification you suggest above (P-1 betas).

If I understand your comment about the pointwise log likelihood contribution, then I just need to store the log_prob (say in transformed parameters) and I should be good to go, right?

again, thanks so much for your insight.


You can take that approach, but it alters the interpretation of the coefficients, and is fairly non-standard.

The log_prob vector itself isn’t the log likelihood contribution, because it contains (log) probabilities of all choices, not just the choice that was made. You need to choose its elements which correspond to choices made.

Hope this helps


If you are modeling a system where people aren’t compelled to make a choice, then you should give them an outside option of choosing “none of the above”.


thanks again! yes, I pulled out the log_probs for the chosen alternatives and it looks good. I am modeling travel route choices, so the outside choice of “not traveling/traveling by another mode” is possible and meaningful. I’m augmenting my data to provide that option.


Hi James,

I have another question about your model (version 2)…this time about inference. How do you calculate marginal rates of substitution in that type of model? I’m particularly interested in the variation in substitution by individuals. If beta[1] represents the effect of cost, and beta[2] is a predictor of interest, can I sample from beta_individual and calculate ratios like…

beta_individual[n,2] / beta_individual[n,1]

for each individual n? I assume I need to worry about Gamma too, right?

Is there a standard way to do this using the entire posterior?



Yep, exactly. You just need to do it in the generated quantities block, which will give you the MRS for each draw. Or you could do if after fitting the model, in whatever environment you like.



I’m revisiting this thread to briefly describe a data structure that a colleague is directing my way. My impression is that a multinomial choice model would be appropriate, but this is largely uncharted territory for folks in my discipline.

The problem pertains to where fishermen choose to fish. There are about 20 fishermen, and there are roughly 10 locations where they might fish. At any given time, the expected productivity at each site is unfortunately unknown (my colleague has no way of knowing how much fish could be harvested at an unexploited location). That said, fishermen who arrive first at the location are able to secure the best spots for fishing, so a variable for the number of others already present at time t is hypothesized to predict fisherman j’s willingness to fish there. The research question focuses in part on whether j will choose to fish at location k or switch to an alternative location (or stop fishing altogether).

Given the repeated observations of fishermen, my inclination is to use a choice model with random effects for the individual fishermen. Possible predictors include site-level variables (number of others present) and then my colleague hopes to operationalize a variable for the productivity that fisherman j has experienced at his present location in the previous time period, t-1. (How much they caught previously is a known variable.) The distance of site k from their present location is also expected to be an informative predictor.

Are there choice models that would be a good fit for this data structure?


This seems more of a time allocation problem than simple choice. How frequently can they change locations? What is the time cost vs time spent fishing? It also seems like there is a latent success rate by sites that depends on arrival rank and is known to some extent by the fishermen from previous experience. that puts it into reinforcement learning or sequential design where the fishermen run some kind of explore and exploit policy about where to fish. the multi-armed bandit literature covers some of this.


These are good questions, Bob, and to some extent I’m focusing on choice because it is a simple way to reduce a more complex, hierarchical decision-making process. There are few restrictions on the fishermen’s ability to choose new sites, which are geographically constricted into a relatively small area (such that travel costs between them are not overwhelmingly high). These fishermen are highly knowledgeable about the area and must weigh the intrinsic productivity of a fishing site against rules about sequential exploitation, which potentially lead them to consider somewhat inferior sites if they have first dibs there. And then they must also consider how long to stay in a particular site.

I am familiarizing myself with the multi-arm bandit literature for the first time. If there are well-documented statistical modeling approaches that would be ideal for this particular problem, I’d appreciate being pointed in that direction.


I don’t know much about the bandit literature or continuous processes myself—certainly not enough to even suggest a non-ideal model.

I suspect there’s also be literature on this in ecology, where the problem of where animals eat seems similar. Animal movement HMMs quantize movement into hourly measurement and conditions movement on state (foraging, sleeping, transiting, diving shallow, diving deep, etc.) and on predictors (distance to nearest water supply, gradient of terrian, type of terrain, etc.)

Could you start just by modeling something like where a fisherman chooses first to fish conditioned on the decision to start fishing at that time? After that, can the time series be meaningfully quantized in terms of will the fisherman stay the next hour or choose a different site or to go home, like the animal movement models I mentioned above? Is the decision to go home a matter also of how long the fisherman’s been out already? I imagine trawlers go for limited numbers of days and individuals with poles for limited numbers of hours.


Thanks, Bob. You’re right that there’s a long tradition among ecologists of testing animals’ choices about where to forage. A seminal theoretical paper on that topic is the marginal value theorem: https://ac.els-cdn.com/004058097690040X/1-s2.0-004058097690040X-main.pdf?_tid=abd0d2c5-1820-4a59-8aa6-58e20dfbb49b&acdnat=1528203711_945b8d348e6562874f22cfae7e46e121

Previously, I hadn’t made the connection that HMMs could be applied to this problem, but your intuition is correct that some ecologists have made progress in this area: http://rsfs.royalsocietypublishing.org/content/early/2012/01/24/rsfs.2011.0077.short

I’ll dive into that literature a little to see if I can find models with data structures that are similar to others. Thanks for the suggestion.