@spinkney I would like to join you on this. This fits my work and current projects.
I was messing around with this again and read the poLCA methodology and looked at their code. The log likelihood in their paper is written as
where Y is 1 if the i^{th} observation “observed” the k^{th} level for the j^{th} variable and 0 otherwise.
In Stan, I’ve coded this as only incrementing the log likelihood when the level is observed at the line with if (y[n, j] == k)
. A more efficient implementation would only loop over the values observed, however, this is relatively fast already.
data {
int<lower=1> K; // number of outcomes
int<lower=1> J; // number of variables
int<lower=1> N; // number of respondents
matrix[N, J] y; // score for obs n
int<lower=1> C; // # of attribute profiles (latent classes)
}
parameters {
simplex[C] nu;
array[J, C] simplex[K] prob;
}
model{
matrix[N, C] llik;
vector[C] lognu = log(nu);
for (n in 1:N) {
llik[n] = lognu';
for (j in 1:J) {
for (k in 1:K) {
if (y[n, j] == k) {
for (c in 1:C) {
llik[n, c] += log(prob[j, c, k]);
}
}
}
}
target += log_sum_exp(llik[n]);
}
}
In the example @Mauricio_Garnier-Villarre gave we omit any values with NA but keep the levels coded by R as 1, 2, …, K.
library(psych)
library(poLCA)
library(cmdstanr)
data(election)
library(data.table)
f2a <- cbind(MORALG,CARESG,KNOWG)~1
nes2a <- poLCA(f2a,election,nclass=5, nrep=20, maxiter=20000)
nes2a
d2 <- as.data.table(d2)
d2_class <- d2[, lapply(.SD, as.numeric)]
setnames(d2_class, names(d2_class), names(d2))
d2_class
d2_class <- na.omit(d2_class)
orig_data <- list(
K = 4,
J = ncol(d2_class),
N = nrow(d2_class),
y = d2_class,
C = 2
)
Using the MLE optimizer in Stan I can get nearly the same results as poLCA up to a reordering.
> stan_optimize
[,1] [,2] [,3] [,4]
[1,] 0.5002110 0.492138 0.00600374 0.00164642
[2,] 0.3284040 0.592320 0.05709290 0.02218260
[3,] 0.4650950 0.506450 0.02428620 0.00416851
[4,] 0.0288600 0.471371 0.33813800 0.16163100
[5,] 0.0133780 0.264007 0.48465200 0.23796300
[6,] 0.0887964 0.620432 0.22458900 0.06618230
> polca_est[c(1,3,5,2,4,6),]
[,1] [,2] [,3] [,4]
[1,] 0.51040275 0.4832574 0.004830797 0.001509072
[2,] 0.33566879 0.5880106 0.054643069 0.021677569
[3,] 0.47317445 0.4993648 0.023399591 0.004061204
[4,] 0.03020396 0.4802569 0.331512181 0.158026913
[5,] 0.01385400 0.2757457 0.476993769 0.233406504
[6,] 0.08992838 0.6244836 0.220739533 0.064848468
Because HMC is trying to explore all the modes and averaging across all these the estimates for that differ significantly
> stan_hmc
[,1] [,2] [,3] [,4]
[1,] 0.2644694 0.4798311 0.1727563 0.08294320
[2,] 0.1708756 0.4271868 0.2709294 0.13100819
[3,] 0.2762197 0.5623499 0.1249803 0.03645005
[4,] 0.2649671 0.4797690 0.1723105 0.08295339
[5,] 0.1702448 0.4277999 0.2700203 0.13193500
[6,] 0.2757564 0.5632814 0.1249389 0.03602324
I forgot to mention that there are no divergences or treedepth issues and it runs in about 8 seconds for 200 wu/ 200 sampling.
@spinkney this seems very interstings. I will play with this code soon (need to get some other things out). But I wonder if the HMC is failing to converged because it requires an extra constraint for identification, like ordering the nu
, like in this thread ?
@spinkney I tried adding a latent class order constraint to the model, it seems to work sometimes. So, in some runs it converged without issues, and in others it doesnt converges as it finds the multimodal posterior. So, my guess is that it seems it it sensnitive to starting values. Also, sensnitive to the alpha prior of the class proportion
data {
int<lower=1> K; // number of outcomes
int<lower=1> J; // number of variables
int<lower=1> N; // number of respondents
matrix[N, J] y; // score for obs n
int<lower=1> C; // of attribute profiles (latent classes)
vector<lower=0>[C] alpha;
}
parameters {
//simplex[C] nu;
positive_ordered[C] lambda;
array[J, C] simplex[K] prob;
}
transformed parameters {
simplex[C] nu = lambda / sum(lambda);
}
model{
matrix[N, C] llik;
vector[C] lognu = log(nu);
for (n in 1:N) {
llik[n, 1:C] = rep_row_vector(0, C);
for (j in 1:J) {
for (k in 1:K) {
if (y[n, j] == k) {
for (c in 1:C) llik[n, c] += log(prob[j, c, k]);
}
}
}
vector[C] llik_n = lognu;
for (c in 1:C)
llik_n += llik[n, c];
target += log_sum_exp(llik[n]);
}
target += gamma_lpdf(lambda | alpha, 1);
}
In any case I have seen (JAGS, Stan etc), the MCMC models need an additional constraint to get away from the multimodal posteriors. The previous code is able to add this constraint by ordered[C] class_effect_non;
I wonder if rather than using random starting values, initializing values taken from the same stan model but using posterior results from a variational inference (rstan::vb) approach for both nu and probs? I’ll give it a try and if I make progress I’ll report back.