Hi all,
I have a dataset of (N=600, D=34) where each row is the set of answers to 34 questions from a questionairre. Each patient belongs from one of nine syndromes, and so I would expect nine separate latent classes to model the data. However, there are similarities between syndromes and so my aim is to see whether one can identify a smaller latent space and map syndromes to an overarching latent profile.
I would like to run a latent profile analysis on the model and at first was using tidyLPA
to do so. However, it doesn’t give me the customisability I’d like for the task and wanted to implement a version in Stan instead.
My first aim was to implement a base latent profile model and see whether I can obtain similar-ish results to tidyLPA
before I add further complexity. However, the results I get are markedly different. (Note that the questionaire answers are in fact ordinal, but since tidyLPA
works with continuous variables, I wrote my first Stan model using that assumption too).
As far as I follow, a continuous latent profile is as follows:
where \pi is a vector of length C indicating the number of latent clusters, q is the question answered in the questionaire, and n is the index of the patient. Of course, since we can’t use continuous variables in Stan I marginalise the z_n variables to get:
My model is posted below. Note that I melted the dataframe such that each row is an answer from a single patient for a single question in the questionaire.
data {
int<lower=1> N; // Data length
int<lower=1> num_patients; // Number of patients
int<lower=1> K; // Number of ordinal answers to the questionaire
int<lower=1> C; // Number of clusters
int<lower=1> S; // Number of Syndromes
int<lower=1> Q; // Number of Questions
int<lower=1, upper=S> s[N]; // Syndrome idx -
int<lower=1, upper=Q> q[N]; // Question idx
int<lower=1, upper=K> y[N]; // Answers
int<lower=1, upper=num_patients> row_patient_idx[N];
int<lower=1, upper=S> patient_syndrome_idx[num_patients];
}
transformed data{
vector[C] alpha_C; // Creating a uniform prior for the dirichlet.
for (ci in 1:C) {
alpha_C[ci] = 1;
}
}
parameters {
simplex[C] pi_alpha;
vector[C] mu_gamma[Q];
vector<lower=0>[C] sigma_gamma[Q];
}
transformed parameters {
vector[C] log_components[N]; {
for (ni in 1:N) {
for (ci in 1:C) {
log_components[ni, ci] = normal_lpdf(y[ni] | mu_gamma[q[ni], ci], sigma_gamma[q[ni], ci]);
}
}
}
}
model {
pi_alpha ~ dirichlet(alpha_C);
for (qi in 1:Q) {
mu_gamma[qi, ] ~ normal(0, 1);
sigma_gamma[qi, ] ~ normal(0, 1);
}
target += log_mix(pi_alpha, log_components);
}
generated quantities {
vector<lower=0, upper=1>[C] prob_cluster_N[num_patients]; {
vector[C] unnormalised_cluster_N[num_patients]; {
for (p in 1:num_patients) {
for (ci in 1:C) {
unnormalised_cluster_N[p, ci] = 0;
}
}
for (ni in 1:N) {
for (ci in 1:C) {
unnormalised_cluster_N[row_patient_idx[ni], ci] = unnormalised_cluster_N[row_patient_idx[ni], ci] + (
normal_lpdf(y[ni] | mu_gamma[q[ni], ci], sigma_gamma[q[ni], ci])
);
}
}
for (p in 1:num_patients) {
prob_cluster_N[p, ] = exp(unnormalised_cluster_N[p, ] - log_sum_exp(unnormalised_cluster_N[p, ]));
}
}
}
vector<lower=0, upper=1>[C] prob_cluster_S[S]; {
vector[C] unnormalised_cluster_S[S]; {
for (si in 1:S) {
for (ci in 1:C) {
unnormalised_cluster_S[si, ci] = 0;
for (p in 1:num_patients) {
if (patient_syndrome_idx[p] == si) {
unnormalised_cluster_S[si, ci] = unnormalised_cluster_S[si, ci] + prob_cluster_N[p, ci];
}
}
}
prob_cluster_S[si, ] = unnormalised_cluster_S[si, ] / sum(unnormalised_cluster_S[si, ]);
}
}
}
}
The items in the generated quantities
are an attempt to find the probability that a patient belongs to a particular cluster, and the probability that a syndrome is mostly drawn from a particular cluster.
With this model and a choice of C=5 I get the following for prob_cluster_N
(after choosing the representative cluster to be from the largest posterior probability for that patient (across all questions).
> table(p_cluster_n_wide$c_idx)
1 2 3 4 5
555 13 3 14 20
However, running a (ideally similar) model in tidyLPA
yields the following:
lpa_model = full_datasets$full_dataframe %>%
as.matrix()%>%
single_imputation() %>%
scale() %>%
estimate_profiles(5)
> table(get_data(lpa_model)$Class)
1 2 3 4 5
98 94 153 54 206
The cluster assignments in the stan model are all concentrated towards cluster 1, whereas they are more evenly distributed in the tidyLPA
model.
Any ideas why that may be, or where I may have gone wrong would be much appreciated. Thanks!
PS: Note that I am inferring this posterior using VI, not MCMC.