Help writing latent class multinomial logit model (for conjoint data)

Hello!

I am trying to write (and use) a model to estimate a latent class multinomial logit model for data obtained through conjoint experiments.

No existing models work quite the way that I want them to. At the end of the day I would like:

  • A set of utilities for each latent class
  • Membership probabilities for each individual for each class
  • [Optionally] Coefficients on each individual-level variable that impact membership probabilities

I have attempted to write this model which is written below. It is certainly does not represent best practices — it doesn’t include proper priors on most of the parameters, for example.

The primary issue with this model is that NUTS sampling (calling rstan::sampling) takes prohibitively long. I believe I waited about an hour and not even 200 samples were obtained with my dataset. I am able to get the model to “work” using optimizing.

What I am looking for is help with the following:

  • Helping to edit the model specification so that it can sample efficiently
  • Helping to place priors on the parameters so that it can (a) sample efficiently and (b) give more reasonable results. If it is possible to promote sparsity on the class membership that would be a bonus.

Below the code, I’ve included a very simple R script to generate some dummy data to feed into the model.

Citation: I adapted most of the code below from Elea Feit’s hierarchical multinomial logit model — repository here


// latent class mnl model conjoint experiments

data {
  int<lower=2> C; // number of alternatives (choices) in each scenario
  int<lower=1> A; // number of attributes for each alternative
  int<lower=1> R; // number of respondents
  int<lower=2> K; // number of latent classes
  int<lower=1> S; // number of choice tasks per respondent
  int<lower=0> G; // number of respondent covariates 
  int<lower=1,upper=C> Y[R, S]; // observed choices
  matrix[C, A] X[R, S]; // matrix of attributes for each choice task (design matrix)
  matrix[G, R] Z; // vector of covariates for each respondent
}

parameters {
  matrix[A, K] Beta; // class level utilities
  matrix[K-1, R] Eta; // class membership values
  matrix[K-1, G] Delta; // class prediction covariates
  real<lower=0> sigma_class;
}

model {

  matrix[K, R] memb_probs; // record the probability of each class for each respondent
  
  // for each person in the database, we
  for(r in 1:R) { 
    
    // compute their class membership probability
    // to make this make sense, we have this intermediate value called "eta"
    // which can be anything, and then we use the softmax formula to actually
    // get to the class membership
    // the reason that its 1 to K-1 is that the numbers have to sum to 1
    // so we let the values for 2+ be whatever but set the first one to 1
    for(k in 1:(K-1)) {
      Eta[k, r] ~ normal(dot_product(Z[, r], Delta[k, ]), sigma_class);
    }
    
    // the first class is a special case since its eta = 1
    memb_probs[1, r] = 1/(1 + sum(exp(Eta[, r])));

    // the rest of the classes have etas that are whatever
    // and we compute the result of the softmax to give us their membership probs
    for(k in 1:(K-1)) {
      memb_probs[k+1, r] = exp(Eta[k, r]) / (1 + sum(exp(Eta[, r])));
    }
    
    // for each class
    for(k in 1:K) {
      // for each potential outcome
      // we compute the likelihood of making the choice that they made
      // weighted by the probability of being in that class
      for(s in 1:S){
        target += memb_probs[k, r] * categorical_logit_lpmf(Y[r,s] | X[r,s] * Beta[,k]);
      }
    }
    
  }
}

And some R script to generate some dummy data and feed into the Stan model. NB: on random data optimizing fails very early, presumably because there is nothing to optimize. With real (which I, unfortunately, cannot share) data it actually does optimize but not very well.

library(MASS)
library(tidyverse)
library(rstan)

generate_lc_mnl_data <- function(
  R = 100, # number of respondens
  S = 12, # number of tasks/respondent
  C = 3, # number of concepts/task
  A = 30, # number of choice covariates (nattrs * nlevels)
  K = 4, # number of classes,
  G = 2 # number of respondent covariates 
) {
  Y <- array(sample(1:C, R*S, replace = T), dim=c(R, S))
  X <- array(rnorm(R*S*C*A), dim=c(R, S, C, A)) 
  Z <- array(rnorm(G*R), dim=c(G, R))
  
  return(list(C=C, A=A, R=R, K=K, S=S, G=G, Y=Y, X=X, Z=Z))
}





standata <- generate_lc_mnl_data()
stanmodel <- stan_model("inst/models/lc-mnl.stan")
modelfit <- sampling(
  stanmodel, 
  data = standata, 
  verbose = T
)

At a glance, two things jumped out at me:

  • 1:
// the first class is a special case since its eta = 1
memb_probs[1, r] = 1/(1 + sum(exp(Eta[, r])));

// the rest of the classes have etas that are whatever
// and we compute the result of the softmax to give us their membership probs
for(k in 1:(K-1)) {
  memb_probs[k+1, r] = exp(Eta[k, r]) / (1 + sum(exp(Eta[, r])));
}

It would be more efficient to use one of the built in functions in Stan for these kinds of expressions if possible, although it seems like exp(.)/(1+sum(exp(.)) doesn’t quite match softmax(.) in the Stan functions reference.

  • 2:
target += memb_probs[k, r] * categorical_logit_lpmf(Y[r,s] | X[r,s] * Beta[,k]);

target and categorical_logit_lpmf(.) are on the log probability scale, but here you’re multiplying it by memb_probs[k, r], which is on the probability scale. I think the correct implementation might be this:

target += log(memb_probs[k, r]) + categorical_logit_lpmf(Y[r,s] | X[r,s] * Beta[,k]);
2 Likes

Thank you @ScottAlder !!

  1. The log scale is a really good catch. I didn’t think of that, and that looks right to me
  2. I tried replacing the line with softmax (with an append_row to make the first entry of its argument constrained to be 1), and if anything it seemed to increase the sampling time

Unfortunately, this will require some more work to get it optimized…

1 Like

Hi tvladeck,

why should Eta be normally distributed?

Best,
Daniel

@daniel_guhl

There is no specific reason it has to be normally distributed. It could be anything that has support over the reals. Normal seems to make sense since the prior density has respondents essentially equally weighted towards the latent classes.

Ok, I see, but typically these latent class models would not have an “extra” random term. Why not use a prior on Delta? Delta = 0 implies that all classes are equally likely.

@daniel_guhl the idea is that Delta are individual-level predictors of class membership, not the class membership probabilities themselves. Would you suggest that I regress the membership probabilities on Delta through a multinomial regression (instead of “going through” Eta)?

I thought Delta are parameters that link individual-specific variables Z to membership probs, no?
Just want to be sure, that we are talking about this model (link below), “only” estimated using full Bayes instead of ML.

I would be interested in Stan code for such a model but I’m busy right now. Might have time to look into it in 1 - 2 weeks.

Best,
Daniel

1 Like

@daniel_guhl

I’m sorry. You’re exactly correct. Z are the predictors and Delta are the parameters. I mis-wrote.

Thank you for this link. This description:

The model identifies a number of unobserved segments, estimates the conjoint model within each segment, and at the same time estimates the association of segment membership with concomitant (consumer descriptor) variables

Is precisely, to a T, the model I am trying to build.

Let me know when you have time to look into it and if you want to collaborate on it.

The other thing — just as a note — is that one thing that is not yet solved in my Stan code is the problem of identifiability between chains. Still have to do that…

Sounds good! But … problem of identifiability not just between chains, also within => label switching

Hey @daniel_guhl — just checking in if you want to work further on this! Happy to take this offline if so (or work on here!)

Hi @tvladeck and @daniel_guhl, have you managed efficiently implementing the model above (described in the Concomitant… article)? And @tvladeck, you mentioned that you managed to get some speed in model building by “optimizing”. Can you explain what exactly did you optimize and how?

Hi — the model is implemented and available here: https://github.com/tvladeck/lcmnl

“Optimizing” is rstan’s function to do maximum likelihood (or a posteriori) estimation versus MCMC sampling

2 Likes

@tvladeck, thanks! I will try it out these days and get back to you with comments (and/or praises :))!

@tvladeck, I’m a bit confused with the Z matrix (matrix[I*S, L] Z) in the LC code (example.R, lcmnl.stan). Why is it initialized to as.matrix(data[task == 1 & alt == 1,.(asc = 1, female, age_26)]) %x% matrix(c(1, 0), 2, 1)? The second matrix (1, 0) isn’t clear to me.
When you multiply the gamma parameters with Z created in this way you get something in the first row and nothing (zeros) in the second row. And these should be, AFAIK, the probabilities of an individual belonging to the 1st/2nd segment.
Btw, is there a way to privately communicate on this forum? I couldn’t find it.

Hi @deekay if you click on my name/avatar there should be a button that says “message”.

So the first matrix is just filtering down to one row per respondent, which has the individual-level covariates. Then we are expanding that matrix to the number of classes with a kronecker product. The way that it was done here is that one of the classes will be a reference class (the nothing / zeroes in the second row)

I see no “message” option. But I’ve found your e-mail so if you want, we can continue this discussion via e-mail, just say so.
I think that rooting one probability (of belonging to a class) as a reference should not be done here. Looking further, I see that you’ve made gammas independent of the class. This is not as described in the model where gammas are indexed by Z covariates and classes (gamma_l_s). This is then applied to the Z covariates + softmax for each individual to get thetas (theta_i_s).
I’ll rewrite the script and get back to you.

You are also assuming no correlation between betas (since you’re generating only a vector of normals, not a multivariate-normal vector)?

I think that rooting one probability (of belonging to a class) as a reference should not be done here.

That would be bad, but I don’t think that’s what’s happening – it’s just fixing one of the values that gets passed into the softmax function to zero, but the resulting probability for that class can be anything:

> softmax_logit = function(x) {
+   
+   exp(x) / sum(exp(x))
+   
+ }
> softmax_logit(c(1,2,3))
[1] 0.09003057 0.24472847 0.66524096
> softmax_logit(c(2,3,4))
[1] 0.09003057 0.24472847 0.66524096
> softmax_logit(c(0, 1, 2))
[1] 0.09003057 0.24472847 0.66524096

You are also assuming no correlation between betas (since you’re generating only a vector of normals, not a multivariate-normal vector)?

Yeah – but modeling correlation would be a fairly straightforward add

True, that would be ok. But why are gammas fixed for all segments? The current script and example work because it’s working with 2 segments. Since one is a reference (as now explained) it leaves gamma coefficients for only one segment. But if there were e.g. 3 segments, than doing it this way would force two segments (not the reference one) to have the same probabilities (within an individual or up to the individual covariates, however you want to look at it). So you actually need vector[S-1, L] gamma to cover different probabilities for different segments.

Adding a covariates matrix to cover correlations is not too hard, but in my experience it slows down the sampling quite much. What is your experience regarding the speed of this script? Did you make some benchmarks?