# Latent Profile Models in Stan

Hi all,

I have a dataset of (N=600, D=34) where each row is the set of answers to 34 questions from a questionairre. Each patient belongs from one of nine syndromes, and so I would expect nine separate latent classes to model the data. However, there are similarities between syndromes and so my aim is to see whether one can identify a smaller latent space and map syndromes to an overarching latent profile.

I would like to run a latent profile analysis on the model and at first was using tidyLPA to do so. However, it doesn’t give me the customisability I’d like for the task and wanted to implement a version in Stan instead.

My first aim was to implement a base latent profile model and see whether I can obtain similar-ish results to tidyLPA before I add further complexity. However, the results I get are markedly different. (Note that the questionaire answers are in fact ordinal, but since tidyLPA works with continuous variables, I wrote my first Stan model using that assumption too).

As far as I follow, a continuous latent profile is as follows:

\pi \sim \text{Dir}(\alpha) \\ z_n\sim \text{Categorical}(\pi)\\ \mu_{cq} \sim \mathcal{N}(0, \sigma_{0})\\ \sigma_{cq}\sim \text{HalfNormal}(0, 1)\\ y_{nq} \sim \mathcal{N}(\mu_{z_nq}, \sigma_{z_nq})

where \pi is a vector of length C indicating the number of latent clusters, q is the question answered in the questionaire, and n is the index of the patient. Of course, since we can’t use continuous variables in Stan I marginalise the z_n variables to get:

y_{nq}\sim \sum_{c=i}^{C}\pi_{c}\mathcal{N}(\mu_{cq}, \sigma_{cq})

My model is posted below. Note that I melted the dataframe such that each row is an answer from a single patient for a single question in the questionaire.

data {
int<lower=1> N;             // Data length
int<lower=1> num_patients; // Number of patients
int<lower=1> K; // Number of ordinal answers to the questionaire
int<lower=1> C; // Number of clusters
int<lower=1> S; // Number of Syndromes
int<lower=1> Q; // Number of Questions

int<lower=1, upper=S> s[N]; // Syndrome idx -
int<lower=1, upper=Q> q[N]; // Question idx
int<lower=1, upper=num_patients> row_patient_idx[N];
int<lower=1, upper=S> patient_syndrome_idx[num_patients];
}

transformed data{
vector[C] alpha_C; // Creating a uniform prior for the dirichlet.
for (ci in 1:C) {
alpha_C[ci] = 1;
}
}

parameters {
simplex[C] pi_alpha;
vector[C] mu_gamma[Q];
vector<lower=0>[C] sigma_gamma[Q];
}

transformed parameters {
vector[C] log_components[N]; {
for (ni in 1:N) {
for (ci in 1:C) {
log_components[ni, ci] = normal_lpdf(y[ni] | mu_gamma[q[ni], ci], sigma_gamma[q[ni], ci]);
}
}
}
}

model {
pi_alpha ~ dirichlet(alpha_C);
for (qi in 1:Q) {
mu_gamma[qi, ] ~ normal(0, 1);
sigma_gamma[qi, ] ~ normal(0, 1);
}
target += log_mix(pi_alpha, log_components);
}

generated quantities {
vector<lower=0, upper=1>[C] prob_cluster_N[num_patients]; {
vector[C] unnormalised_cluster_N[num_patients]; {
for (p in 1:num_patients) {
for (ci in 1:C) {
unnormalised_cluster_N[p, ci] = 0;
}
}
for (ni in 1:N) {
for (ci in 1:C) {
unnormalised_cluster_N[row_patient_idx[ni], ci] = unnormalised_cluster_N[row_patient_idx[ni], ci] + (
normal_lpdf(y[ni] | mu_gamma[q[ni], ci], sigma_gamma[q[ni], ci])
);
}
}
for (p in 1:num_patients) {
prob_cluster_N[p, ] = exp(unnormalised_cluster_N[p, ] - log_sum_exp(unnormalised_cluster_N[p, ]));
}
}
}
vector<lower=0, upper=1>[C] prob_cluster_S[S]; {
vector[C] unnormalised_cluster_S[S]; {
for (si in 1:S) {
for (ci in 1:C) {
unnormalised_cluster_S[si, ci] = 0;
for (p in 1:num_patients) {
if (patient_syndrome_idx[p] == si) {
unnormalised_cluster_S[si, ci] = unnormalised_cluster_S[si, ci] + prob_cluster_N[p, ci];
}
}
}
prob_cluster_S[si, ] = unnormalised_cluster_S[si, ] / sum(unnormalised_cluster_S[si, ]);
}
}
}
}


The items in the generated quantities are an attempt to find the probability that a patient belongs to a particular cluster, and the probability that a syndrome is mostly drawn from a particular cluster.

With this model and a choice of C=5 I get the following for prob_cluster_N (after choosing the representative cluster to be from the largest posterior probability for that patient (across all questions).

> table(p_cluster_n_wide$c_idx) 1 2 3 4 5 555 13 3 14 20  However, running a (ideally similar) model in tidyLPA yields the following: lpa_model = full_datasets$full_dataframe %>%
as.matrix()%>%
single_imputation() %>%
scale() %>%
estimate_profiles(5)

> table(get_data(lpa_model)\$Class)

1   2   3   4   5
98  94 153  54 206


The cluster assignments in the stan model are all concentrated towards cluster 1, whereas they are more evenly distributed in the tidyLPA model.

Any ideas why that may be, or where I may have gone wrong would be much appreciated. Thanks!

PS: Note that I am inferring this posterior using VI, not MCMC.

For each respondent, you need to calculate the total log-likelihood (across all questions) for each class, then use log_mix() over those. It looks like you are doing this over the entire data instead of by respondent.

Here’s an example using log_sum_exp() instead of log_mix() (they’re used for the same purpose, though log_sum_mix() doesn’t take a separate vector of mixing proportions).

transformed parameters{
// One vector for each respondent
vector[C] log_components[num_patients];
for(np in 1:num_patients){
// Initialize log-components with the log of baseline probabilities
// instead of providing mixing proportions as in log_mix()
log_components[np] = log(pi_alpha);
}

for (ni in 1:N) {
int np = row_patient_idx[ni];
for (ci in 1:C) {
// Add to the respondent's log-likelihood
log_components[np, ci] += normal_lpdf(y[ni] | mu_gamma[q[ni], ci], sigma_gamma[q[ni], ci]);
}
}
}
model{
...
for(np in 1:num_patients){
target += log_sum_exp(log_components[np]);
}
}

1 Like

Ah I see! Thanks for the help.

Taking a look it seems that that in this case, the patient level addition to the likelihood (which I wasn’t doing)is equal to

\sum_{n}^{N}\sum_{c}^{C} \pi_c \mathcal{N}(\mathbf{y}_{n}~|~\mathbf{\mu}_{c}, \Sigma_{c}) = \sum_{n}^{N}\sum_{c}^{C} \pi_c \prod_{q}^{Q} \mathcal{N}(y_{qn}~|~\mu_{qc}, \sigma_{qc})

where \Sigma_{C} has off diagonals equal to zero and each component with a separate variance term.

I think your answer (on the RHS of the above eq.) should also be equivalent to (after rejigging the data a little bit) the LHS. Implementation wise in Stan I got the following:

...
for (np in 1:num_patients) {
vector[C] placeholder_pq;
for (ci in 1:C) {
placeholder_pq[ci] = log(pi_alpha[ci]) + normal_lpdf(Y_matrix[np, ] | mu_gamma [, ci], sigma_gamma[, ci]);
}
target += log_sum_exp(placeholder_pq);
}
...


where Y_matrix is instead the square matrix of responses (of size N x Q).

It is working far better now, thanks again!