Identification issues with Latent Class Analysis/Mixture model

Unfortunately that is another identifying assumption that substantially changes the meaning of the model. It assumes that the class probabilities for each item sum to 1. For example, if Class 1 has a 90% chance of answering item 1 correctly, Class 2 has a 10% chance (in a 2-class solution). As you might imagine, this does not usually follow from the theoretical models motivating LCA. Nor would flipping the two indices so that pi is a simplex within classes across items.

I get unimodal distributions when I drop the model-estimated prior for pi in favor of a preset one. So that remains my best guess why your specific example ends up multi-modal. I doubt that it fully solves the problem of identifying the model, though. (Note that the x-axis should actually read pi, i.e. the item probabilities in log-odds.)

model{
    // Priors
    to_vector(pi) ~ normal(0, 5);
    target += gamma_lpdf(lambda | alpha_lambda, beta_lambda);
    alpha_lambda ~ cauchy(1,2.5);
    beta_lambda ~ cauchy(1,2.5);
    
    target += log_mix(nu, Ys);
}

1 Like

@spinkney sadly the ordering of pi doesnt work theoretically as it changes the structure of the model.

Thanks @simonbrauer , I tried with the change in priors, and result in some runs working fine and other dont. Still sensitive to starting values between runs

Coming back at this, saw this other post in the forum Latent class previous post where thet adjust the DINA model code

And after testing found that this method to estimate latent classes from binary data works well for increasing numbers of classes. With a small change in the constraints between C = 2 and C > 2.

I am not clear why this approach works but using the bernoulli_lpmf distribution doesnt work. I was looking to use the Stan distributions because that way its easier to combine different types of items, as for example this code only works for binary items

Here is the code for C=2

data {
  int<lower=1> I;                                     // number of items
  int<lower=1> J;                                     // number of respondents
  matrix[J, I] y;                                     // score for obs n
  int<lower=1> C;           // # of attribute profiles (latent classes) 
}
parameters {
  simplex[C] nu;
  vector<lower=0>[C] alpha_dir;
  // average intercept and main effect
  real mean_intercept;
  real<lower=0> mean_maineffect;
  // item level deviations from average effects
  real dev_intercept[I];
  real<lower=-mean_maineffect> dev_maineffect[I];
}

transformed parameters {
  matrix[I,C] pi;    // Probability of correct response for each class on each item
  vector[C] ps[J];
  real log_items[I];
  
  real intercept[I];
  real maineffect[I];
  vector[I] master_pi;
  vector[I] nonmaster_pi;
  
  for (i in 1:I) {
    intercept[i] = mean_intercept + dev_intercept[i];
    maineffect[i] = mean_maineffect + dev_maineffect[i];
    
    nonmaster_pi[i] = inv_logit(intercept[i]);
    master_pi[i] = inv_logit(intercept[i] + maineffect[i]);
  }
  
    // Probability of correct response for each class on each item
  for (c in 1:C) {
    for (i in 1:I) {
      pi[i,c] = master_pi[i]^(c - 1) * nonmaster_pi[i]^(1 - (c - 1));
    }
  }
  
      // Define model
  for (j in 1:J) {
    for (c in 1:C) {
      for (i in 1:I) {
        log_items[i] = y[j,i] * log(pi[i,c]) + (1 - y[j,i]) * log(1 - pi[i,c]);
      }
      ps[j][c] = sum(log_items);
    }
      }
  
}
model{

  // Priors
  nu ~ dirichlet(alpha_dir);
  alpha_dir ~ gamma(1, 0.5);
   
  mean_intercept ~ normal(0, 5);
  mean_maineffect ~ normal(0, 5);
  
  dev_intercept ~ normal(0, 3);
  dev_maineffect ~ normal(0, 3);
  
  target += log_mix(nu, ps);

}

And when C > 2, the parameter block changes to

parameters {
  simplex[C] nu;
  vector<lower=0>[C] alpha_dir;
  // average intercept and main effect
  real mean_intercept;
  real mean_maineffect; 
  // item level deviations from average effects
  real dev_intercept[I];
  real dev_maineffect[I]; 
}

And here is the R code for runing it

library(poLCA)
library(loo)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

dat <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lca1.dat",
                  header=F)

colnames(dat) <- c("id", paste0("item",1:9))
head(dat)

apply(dat[,2:10], 2, table)

dat2 <- dat[,2:10]
head(dat2)

wide_data <- list(
  I = ncol(dat2),
  J = nrow(dat2),
  y = dat2,
  C = 3
)
  
pars <- c("nu","pi")
iter <- 1000
warmup <- iter - 500
wide_model <- stan(model_code=lca_wide, 
                   data = wide_data,
                   pars=pars,
                   chains = 3, 
                   iter = iter,
                   warmup = warmup)
wide_model

## results with poLCA
ch2 <- poLCA(cbind(item1, item2, item3, item4, item5, item6, item7, item8, item9)~1,
             dat2+1,nclass=4, maxiter=20000)
ch2

Now I am trying to adjust that approach to include categorical indicators with more than 2 answers, and continuous items

2 Likes

Interesting! I’m surprised it works better than other methods too. Interested in hearing what you learn as you extend it.

I would be leery of this line for C > 2. It assumes an odd structure for how item probabilities are related across classes that I suspect will be too constraining. Take the C=3 example where P_1 = \text{master_pi}_1 and P_2 = \text{nonmaster_pi}_2. The item probability for class 1 would by P_1^{1-1} \times P_2^{1 - (1 - 1)} = P_2. The item probability for class 2 would by P_1^{2-1} \times P_2^{1 - (2 - 1)} = P_1. And the item probability for class 1 would by P_1^{3-1} \times P_2^{1 - (3 - 1)} = P_1^2 \times P_2^{-1} = \frac{P_1^2}{P_2}.

1 Like

Removing that restriction and coding the model seems to recover the poLCA parameters and with no issues detected. But it’s so much slower (takes about 14 mins on my laptop).

data {
  int<lower=1> I;                                     // number of items
  int<lower=1> J;                                     // number of respondents
  matrix[J, I] y;                                     // score for obs n
  int<lower=1> C;           // # of attribute profiles (latent classes) 
}
parameters {
  simplex[C] nu;
  // average intercept and main effect
  real mean_intercept;
  // item level deviations from average effects
  vector[I] dev_intercept;
  vector<lower=rep_array(-fabs(dev_intercept), C), upper=rep_array(fabs(dev_intercept), C)>[I] class_interaction[C];
  ordered[C] class_effect_non;
  
}

transformed parameters {
  matrix[I, C] log_prob;    // Probability of correct response for each class on each item
  // Probability of correct response for each class on each item
  for (c in 1:C) 
     log_prob[, c] = log_inv_logit(mean_intercept + dev_intercept + class_interaction[c] + class_effect_non[c]);
}
model{
  // Priors
  mean_intercept ~ normal(0, 5);
  
  for (c in 1:C)
    class_interaction[c] ~ std_normal();
  
  class_effect_non ~ normal(0, 3);
  dev_intercept ~ normal(0, 3);
 
  {
    for (j in 1:J) {
      row_vector[C] ps = log(nu') + y[j] * log_prob + (1 - y[j]) *  log1m_exp(log_prob);
      target += log_sum_exp(ps);
    }
  }
}
generated quantities {
  matrix[I, C] probs;
  
  for (c in 1:C)
    probs[, c] = inv_logit(mean_intercept + dev_intercept + class_interaction[c] + class_effect_non[c]);
}

The poLCA probs with 3 latent classes:

Estimated class population shares 
 0.5576 0.3631 0.0793 

> matrix(do.call("rbind", ch2$probs)[, 2], nrow = 9, ncol = 3, byrow = T)
            [,1]       [,2]      [,3]
 [1,] 0.90827110 0.31215950 0.9232933
 [2,] 0.33740620 0.16402789 0.5461474
 [3,] 0.06674344 0.03573206 0.4263940
 [4,] 0.06548031 0.05604210 0.4179348
 [5,] 0.21915046 0.04436150 0.7654777
 [6,] 0.31963272 0.18293996 0.4710176
 [7,] 0.11281159 0.09768093 0.5123801
 [8,] 0.13989171 0.10987954 0.6192120
 [9,] 0.32487311 0.18780498 0.3488297

the Stan model (swap rows 1 and 2)

> wide_model$summary("nu")
# A tibble: 3 x 10
  variable  mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 nu[1]    0.312 0.291  0.137  0.136  0.123  0.563  1.02     212.     281.
2 nu[2]    0.585 0.608  0.134  0.134  0.329  0.772  1.02     223.     393.
3 nu[3]    0.102 0.0980 0.0317 0.0302 0.0590 0.160  1.03     113.     600.
> matrix(wide_model$summary("probs")$mean, nr = 9, nc = 3, byrow = F)
            [,1]       [,2]      [,3]
 [1,] 0.34482077 0.82991176 0.9347736
 [2,] 0.12325590 0.31869247 0.5406846
 [3,] 0.03166788 0.06463380 0.3658653
 [4,] 0.04844336 0.06352205 0.3757007
 [5,] 0.04460407 0.20768053 0.6291819
 [6,] 0.15387589 0.32565031 0.5479768
 [7,] 0.08637647 0.11032362 0.4652101
 [8,] 0.10067586 0.13870967 0.5374320
 [9,] 0.14795669 0.31827212 0.4234829
1 Like

This simplifies the model and only tries to find parameters for the item by class interactions. It takes four min 2 min on my machine. Though you still may not like the ordered vector restriction.

data {
  int<lower=1> I;                                     // number of items
  int<lower=1> J;                                     // number of respondents
  matrix[J, I] y;                                     // score for obs n
  int<lower=1> C;           // # of attribute profiles (latent classes) 
}
parameters {
  simplex[C] nu;
  ordered[C] class_interaction[I];
}
transformed parameters {
  matrix[I, C] log_prob;    // Probability of correct response for each class on each item
  // Probability of correct response for each class on each item
  for (c in 1:C) 
    log_prob[, c] = log_inv_logit( to_vector(class_interaction[ , c]));
}
model{
  // Priors
  for (c in 1:C)
     to_vector(class_interaction[ , c]) ~ std_normal();
  {
    for (j in 1:J) {
      row_vector[C] ps = log(nu') + y[j] * log_prob + (1 - y[j]) *  log1m_exp(log_prob);
      target += log_sum_exp(ps);
    }
  }
}
generated quantities {
  matrix[I, C] probs;
  
  for (c in 1:C)
    probs[, c] = inv_logit( to_vector(class_interaction[ , c]));
}

1 Like

@simonbrauer thanks for pointing out the issue with that part of the code.

@spinkney these are very intereting changes to the code for the model. I will play with them, thank you very much

1 Like

@Mauricio_Garnier-Villarre Did you happen to find a solution for polytomous variables?

@Kevin_Linares no, I havent been able to extend this for polytomous items. Also, to be fair, for the sake of the project I am working on I drop this so I havent worked on it for a bit

If you want to try this, I would susgget to take this code, that nis written for the log probability of answering 1 (like logistic regression)

  for (c in 1:C) 
     log_prob[, c] = log_inv_logit(mean_intercept + dev_intercept + class_interaction[c] + class_effect_non[c]);
}

And change it for multinomial logistic approach for example

@Kevin_Linares i did a quick test, and if you use the modified code shared by @spinkney with polytomous items its works as long as you separate them into all possible binary/dummy variables

Something like this example

data {
  int<lower=1> I;                                     // number of items
  int<lower=1> J;                                     // number of respondents
  matrix[J, I] y;                                     // score for obs n
  int<lower=1> C;           // # of attribute profiles (latent classes) 
}
parameters {
  simplex[C] nu;
  // average intercept and main effect
  real mean_intercept;
  // item level deviations from average effects
  vector[I] dev_intercept;
  vector<lower=-7, upper=7>[I] class_interaction[C];
  ordered[C] class_effect_non;
}

transformed parameters {
  matrix[I, C] log_prob;    // Probability of correct response for each class on each item
  // Probability of correct response for each class on each item
  for (c in 1:C) 
     log_prob[, c] = log_inv_logit(mean_intercept + dev_intercept + class_interaction[c] + class_effect_non[c]);
}
model{
  // Priors
  mean_intercept ~ normal(0, 5);
  
  for (c in 1:C)
    class_interaction[c] ~ std_normal();
  
  class_effect_non ~ normal(0, 3);
  dev_intercept ~ normal(0, 3);
 
  {
    for (j in 1:J) {
      row_vector[C] ps = log(nu') + y[j] * log_prob + (1 - y[j]) *  log1m_exp(log_prob);
      target += log_sum_exp(ps);
    }
  }
}
generated quantities {
  matrix[I, C] probs;
  
  for (c in 1:C)
    probs[, c] = inv_logit(mean_intercept + dev_intercept + class_interaction[c] + class_effect_non[c]);
}
library(psych)
library(poLCA)
data(election)


f2a <- cbind(MORALG,CARESG,KNOWG)~1
nes2a <- poLCA(f2a,election,nclass=2,nrep=10,  maxiter=20000)
nes2a


d2 <- election[,c("MORALG","CARESG","KNOWG")]
head(d2)
apply(d2, 2, table)
summary(d2)

table(d2[,1],dummy.code(d2[,1])[,1] )

d3 <- cbind(dummy.code(d2[,1]),
            dummy.code(d2[,2]),
            dummy.code(d2[,3]) )

head(d3)
d4 <- na.omit(d3)
dim(d4)
head(d4)

wide_data <- list(
  I = ncol(d4),
  J = nrow(d4),
  y = d4,
  C = 2
)

pars <- c("nu","probs")
iter <- 1000
warmup <- iter - 500
wide_model <- stan(model_code=lca_wide, 
                   data = wide_data,
                   pars=pars,
                   chains = 3, 
                   iter = iter,
                   warmup = warmup)
wide_model

2 Likes

I finally had a chance to test this stan code with polytomous variables, and as long as I use all categories dummy coded into binary data it works well. Has anybody extended this code that spinkney shared to write out predicted class memberships in the generated quantities code chunk?

1 Like

Kevin

This added sections in the generated quantities should work, by having probs_sub as the probability of beloging on each class per subject

generated quantities {
  matrix[I, C] probs;
  matrix[J, C] probs_sub;
  
  for (c in 1:C){
    probs[, c] = inv_logit(mean_intercept + dev_intercept + class_interaction[c] + class_effect_non[c]);
  }
     
     {
       matrix[J, C] temp_log;
      vector[J] temp_sum;
      for (j in 1:J) {
       temp_log[j,] = log(nu') + y[j] * log_prob + (1 - y[j]) *  log1m_exp(log_prob);
       temp_sum[j] = sum(temp_log[j,]);
       
       for(c in 1:C){
       probs_sub[j,c] = temp_log[j,c] / temp_sum[j];
       }}
     }

}

I can definetely be improved to be faster, I am not great optimizing Stan code, and I rely too much on loops

@Kevin_Linares / @Mauricio_Garnier-Villarre would either of you (or both) like to put together a case study for this with me?

1 Like

@spinkney I would like to join you on this. This fits my work and current projects.

2 Likes

I was messing around with this again and read the poLCA methodology and looked at their code. The log likelihood in their paper is written as

\log L = \sum_{i=1}^N \ln \sum_{r=1}^{R} p_r \prod_{j=1}^J \prod_{k=1}^{K_j} (\pi_{jrk})^{Y_ijk}

where Y is 1 if the i^{th} observation “observed” the k^{th} level for the j^{th} variable and 0 otherwise.

In Stan, I’ve coded this as only incrementing the log likelihood when the level is observed at the line with if (y[n, j] == k). A more efficient implementation would only loop over the values observed, however, this is relatively fast already.

data {
  int<lower=1> K;  // number of outcomes
  int<lower=1> J;  // number of variables
  int<lower=1> N;  // number of respondents
  matrix[N, J] y;    // score for obs n
  int<lower=1> C;    // # of attribute profiles (latent classes) 
}
parameters {
  simplex[C] nu;
  array[J, C] simplex[K] prob;
}
model{
  matrix[N, C] llik;
  vector[C] lognu = log(nu);
  
 for (n in 1:N) {
    llik[n] = lognu';
    for (j in 1:J) {
      for (k in 1:K) {
        if (y[n, j] == k) {
         for (c in 1:C) {
           llik[n, c] += log(prob[j, c, k]);
         }
        }
      }
    }

    target += log_sum_exp(llik[n]);
  }
}

In the example @Mauricio_Garnier-Villarre gave we omit any values with NA but keep the levels coded by R as 1, 2, …, K.

library(psych)
library(poLCA)
library(cmdstanr)
data(election)

library(data.table)
f2a <- cbind(MORALG,CARESG,KNOWG)~1
nes2a <- poLCA(f2a,election,nclass=5, nrep=20,  maxiter=20000)
nes2a

d2 <- as.data.table(d2)
d2_class <- d2[, lapply(.SD, as.numeric)]
setnames(d2_class, names(d2_class), names(d2))
d2_class

d2_class <- na.omit(d2_class)

orig_data <- list(
  K = 4,
  J = ncol(d2_class),
  N = nrow(d2_class),
  y = d2_class,
  C = 2
)

Using the MLE optimizer in Stan I can get nearly the same results as poLCA up to a reordering.

> stan_optimize
          [,1]     [,2]       [,3]       [,4]
[1,] 0.5002110 0.492138 0.00600374 0.00164642
[2,] 0.3284040 0.592320 0.05709290 0.02218260
[3,] 0.4650950 0.506450 0.02428620 0.00416851
[4,] 0.0288600 0.471371 0.33813800 0.16163100
[5,] 0.0133780 0.264007 0.48465200 0.23796300
[6,] 0.0887964 0.620432 0.22458900 0.06618230
> polca_est[c(1,3,5,2,4,6),]
           [,1]      [,2]        [,3]        [,4]
[1,] 0.51040275 0.4832574 0.004830797 0.001509072
[2,] 0.33566879 0.5880106 0.054643069 0.021677569
[3,] 0.47317445 0.4993648 0.023399591 0.004061204
[4,] 0.03020396 0.4802569 0.331512181 0.158026913
[5,] 0.01385400 0.2757457 0.476993769 0.233406504
[6,] 0.08992838 0.6244836 0.220739533 0.064848468

Because HMC is trying to explore all the modes and averaging across all these the estimates for that differ significantly

> stan_hmc
          [,1]      [,2]      [,3]       [,4]
[1,] 0.2644694 0.4798311 0.1727563 0.08294320
[2,] 0.1708756 0.4271868 0.2709294 0.13100819
[3,] 0.2762197 0.5623499 0.1249803 0.03645005
[4,] 0.2649671 0.4797690 0.1723105 0.08295339
[5,] 0.1702448 0.4277999 0.2700203 0.13193500
[6,] 0.2757564 0.5632814 0.1249389 0.03602324
1 Like

I forgot to mention that there are no divergences or treedepth issues and it runs in about 8 seconds for 200 wu/ 200 sampling.

@spinkney this seems very interstings. I will play with this code soon (need to get some other things out). But I wonder if the HMC is failing to converged because it requires an extra constraint for identification, like ordering the nu , like in this thread ?

@spinkney I tried adding a latent class order constraint to the model, it seems to work sometimes. So, in some runs it converged without issues, and in others it doesnt converges as it finds the multimodal posterior. So, my guess is that it seems it it sensnitive to starting values. Also, sensnitive to the alpha prior of the class proportion

data {
  int<lower=1> K;  // number of outcomes
  int<lower=1> J;  // number of variables
  int<lower=1> N;  // number of respondents
  matrix[N, J] y;    // score for obs n
  int<lower=1> C;    // of attribute profiles (latent classes) 
  vector<lower=0>[C] alpha;
}
parameters {
  //simplex[C] nu;
  positive_ordered[C] lambda;
  array[J, C] simplex[K] prob;
}
transformed parameters {
  simplex[C] nu = lambda / sum(lambda);
}
model{
  matrix[N, C] llik;
  vector[C] lognu = log(nu);
  
  for (n in 1:N) {
    llik[n, 1:C] = rep_row_vector(0, C);
    for (j in 1:J) {
      for (k in 1:K) {
        if (y[n, j] == k) {
         for (c in 1:C) llik[n, c] += log(prob[j, c, k]);
        }
      }
    }
    vector[C] llik_n = lognu;

    for (c in 1:C)
      llik_n += llik[n, c];

    target += log_sum_exp(llik[n]);
  }
  
  target += gamma_lpdf(lambda | alpha, 1);
  
}

In any case I have seen (JAGS, Stan etc), the MCMC models need an additional constraint to get away from the multimodal posteriors. The previous code is able to add this constraint by ordered[C] class_effect_non;

1 Like

I wonder if rather than using random starting values, initializing values taken from the same stan model but using posterior results from a variational inference (rstan::vb) approach for both nu and probs? I’ll give it a try and if I make progress I’ll report back.

1 Like