Help writing latent class multinomial logit model (for conjoint data)

performance
fitting-issues
specification

#1

Hello!

I am trying to write (and use) a model to estimate a latent class multinomial logit model for data obtained through conjoint experiments.

No existing models work quite the way that I want them to. At the end of the day I would like:

  • A set of utilities for each latent class
  • Membership probabilities for each individual for each class
  • [Optionally] Coefficients on each individual-level variable that impact membership probabilities

I have attempted to write this model which is written below. It is certainly does not represent best practices — it doesn’t include proper priors on most of the parameters, for example.

The primary issue with this model is that NUTS sampling (calling rstan::sampling) takes prohibitively long. I believe I waited about an hour and not even 200 samples were obtained with my dataset. I am able to get the model to “work” using optimizing.

What I am looking for is help with the following:

  • Helping to edit the model specification so that it can sample efficiently
  • Helping to place priors on the parameters so that it can (a) sample efficiently and (b) give more reasonable results. If it is possible to promote sparsity on the class membership that would be a bonus.

Below the code, I’ve included a very simple R script to generate some dummy data to feed into the model.

Citation: I adapted most of the code below from Elea Feit’s hierarchical multinomial logit model — repository here


// latent class mnl model conjoint experiments

data {
  int<lower=2> C; // number of alternatives (choices) in each scenario
  int<lower=1> A; // number of attributes for each alternative
  int<lower=1> R; // number of respondents
  int<lower=2> K; // number of latent classes
  int<lower=1> S; // number of choice tasks per respondent
  int<lower=0> G; // number of respondent covariates 
  int<lower=1,upper=C> Y[R, S]; // observed choices
  matrix[C, A] X[R, S]; // matrix of attributes for each choice task (design matrix)
  matrix[G, R] Z; // vector of covariates for each respondent
}

parameters {
  matrix[A, K] Beta; // class level utilities
  matrix[K-1, R] Eta; // class membership values
  matrix[K-1, G] Delta; // class prediction covariates
  real<lower=0> sigma_class;
}

model {

  matrix[K, R] memb_probs; // record the probability of each class for each respondent
  
  // for each person in the database, we
  for(r in 1:R) { 
    
    // compute their class membership probability
    // to make this make sense, we have this intermediate value called "eta"
    // which can be anything, and then we use the softmax formula to actually
    // get to the class membership
    // the reason that its 1 to K-1 is that the numbers have to sum to 1
    // so we let the values for 2+ be whatever but set the first one to 1
    for(k in 1:(K-1)) {
      Eta[k, r] ~ normal(dot_product(Z[, r], Delta[k, ]), sigma_class);
    }
    
    // the first class is a special case since its eta = 1
    memb_probs[1, r] = 1/(1 + sum(exp(Eta[, r])));

    // the rest of the classes have etas that are whatever
    // and we compute the result of the softmax to give us their membership probs
    for(k in 1:(K-1)) {
      memb_probs[k+1, r] = exp(Eta[k, r]) / (1 + sum(exp(Eta[, r])));
    }
    
    // for each class
    for(k in 1:K) {
      // for each potential outcome
      // we compute the likelihood of making the choice that they made
      // weighted by the probability of being in that class
      for(s in 1:S){
        target += memb_probs[k, r] * categorical_logit_lpmf(Y[r,s] | X[r,s] * Beta[,k]);
      }
    }
    
  }
}

And some R script to generate some dummy data and feed into the Stan model. NB: on random data optimizing fails very early, presumably because there is nothing to optimize. With real (which I, unfortunately, cannot share) data it actually does optimize but not very well.

library(MASS)
library(tidyverse)
library(rstan)

generate_lc_mnl_data <- function(
  R = 100, # number of respondens
  S = 12, # number of tasks/respondent
  C = 3, # number of concepts/task
  A = 30, # number of choice covariates (nattrs * nlevels)
  K = 4, # number of classes,
  G = 2 # number of respondent covariates 
) {
  Y <- array(sample(1:C, R*S, replace = T), dim=c(R, S))
  X <- array(rnorm(R*S*C*A), dim=c(R, S, C, A)) 
  Z <- array(rnorm(G*R), dim=c(G, R))
  
  return(list(C=C, A=A, R=R, K=K, S=S, G=G, Y=Y, X=X, Z=Z))
}





standata <- generate_lc_mnl_data()
stanmodel <- stan_model("inst/models/lc-mnl.stan")
modelfit <- sampling(
  stanmodel, 
  data = standata, 
  verbose = T
)


#2

At a glance, two things jumped out at me:

  • 1:
// the first class is a special case since its eta = 1
memb_probs[1, r] = 1/(1 + sum(exp(Eta[, r])));

// the rest of the classes have etas that are whatever
// and we compute the result of the softmax to give us their membership probs
for(k in 1:(K-1)) {
  memb_probs[k+1, r] = exp(Eta[k, r]) / (1 + sum(exp(Eta[, r])));
}

It would be more efficient to use one of the built in functions in Stan for these kinds of expressions if possible, although it seems like exp(.)/(1+sum(exp(.)) doesn’t quite match softmax(.) in the Stan functions reference.

  • 2:
target += memb_probs[k, r] * categorical_logit_lpmf(Y[r,s] | X[r,s] * Beta[,k]);

target and categorical_logit_lpmf(.) are on the log probability scale, but here you’re multiplying it by memb_probs[k, r], which is on the probability scale. I think the correct implementation might be this:

target += log(memb_probs[k, r]) + categorical_logit_lpmf(Y[r,s] | X[r,s] * Beta[,k]);

#3

Thank you @ScottAlder !!

  1. The log scale is a really good catch. I didn’t think of that, and that looks right to me
  2. I tried replacing the line with softmax (with an append_row to make the first entry of its argument constrained to be 1), and if anything it seemed to increase the sampling time

Unfortunately, this will require some more work to get it optimized…


#4

Hi tvladeck,

why should Eta be normally distributed?

Best,
Daniel


#5

@daniel_guhl

There is no specific reason it has to be normally distributed. It could be anything that has support over the reals. Normal seems to make sense since the prior density has respondents essentially equally weighted towards the latent classes.


#6

Ok, I see, but typically these latent class models would not have an “extra” random term. Why not use a prior on Delta? Delta = 0 implies that all classes are equally likely.


#7

@daniel_guhl the idea is that Delta are individual-level predictors of class membership, not the class membership probabilities themselves. Would you suggest that I regress the membership probabilities on Delta through a multinomial regression (instead of “going through” Eta)?


#8

I thought Delta are parameters that link individual-specific variables Z to membership probs, no?
Just want to be sure, that we are talking about this model (link below), “only” estimated using full Bayes instead of ML.

I would be interested in Stan code for such a model but I’m busy right now. Might have time to look into it in 1 - 2 weeks.

Best,
Daniel


#9

@daniel_guhl

I’m sorry. You’re exactly correct. Z are the predictors and Delta are the parameters. I mis-wrote.

Thank you for this link. This description:

The model identifies a number of unobserved segments, estimates the conjoint model within each segment, and at the same time estimates the association of segment membership with concomitant (consumer descriptor) variables

Is precisely, to a T, the model I am trying to build.

Let me know when you have time to look into it and if you want to collaborate on it.

The other thing — just as a note — is that one thing that is not yet solved in my Stan code is the problem of identifiability between chains. Still have to do that…


#10

Sounds good! But … problem of identifiability not just between chains, also within => label switching