Hi all,

I have a dataset of (N=600, D=34) where each row is the set of answers to 34 questions from a questionairre. Each patient belongs from one of nine syndromes, and so I would expect nine separate latent classes to model the data. However, there are similarities between syndromes and so my aim is to see whether one can identify a smaller latent space and map syndromes to an overarching latent profile.

I would like to run a latent profile analysis on the model and at first was using `tidyLPA`

to do so. However, it doesn’t give me the customisability I’d like for the task and wanted to implement a version in Stan instead.

My first aim was to implement a base latent profile model and see whether I can obtain similar-ish results to `tidyLPA`

before I add further complexity. However, the results I get are markedly different. (Note that the questionaire answers are in fact ordinal, but since `tidyLPA`

works with continuous variables, I wrote my first Stan model using that assumption too).

As far as I follow, a continuous latent profile is as follows:

where \pi is a vector of length C indicating the number of latent clusters, q is the question answered in the questionaire, and n is the index of the patient. Of course, since we can’t use continuous variables in Stan I marginalise the z_n variables to get:

My model is posted below. Note that I melted the dataframe such that each row is an answer from a single patient for a single question in the questionaire.

```
data {
int<lower=1> N; // Data length
int<lower=1> num_patients; // Number of patients
int<lower=1> K; // Number of ordinal answers to the questionaire
int<lower=1> C; // Number of clusters
int<lower=1> S; // Number of Syndromes
int<lower=1> Q; // Number of Questions
int<lower=1, upper=S> s[N]; // Syndrome idx -
int<lower=1, upper=Q> q[N]; // Question idx
int<lower=1, upper=K> y[N]; // Answers
int<lower=1, upper=num_patients> row_patient_idx[N];
int<lower=1, upper=S> patient_syndrome_idx[num_patients];
}
transformed data{
vector[C] alpha_C; // Creating a uniform prior for the dirichlet.
for (ci in 1:C) {
alpha_C[ci] = 1;
}
}
parameters {
simplex[C] pi_alpha;
vector[C] mu_gamma[Q];
vector<lower=0>[C] sigma_gamma[Q];
}
transformed parameters {
vector[C] log_components[N]; {
for (ni in 1:N) {
for (ci in 1:C) {
log_components[ni, ci] = normal_lpdf(y[ni] | mu_gamma[q[ni], ci], sigma_gamma[q[ni], ci]);
}
}
}
}
model {
pi_alpha ~ dirichlet(alpha_C);
for (qi in 1:Q) {
mu_gamma[qi, ] ~ normal(0, 1);
sigma_gamma[qi, ] ~ normal(0, 1);
}
target += log_mix(pi_alpha, log_components);
}
generated quantities {
vector<lower=0, upper=1>[C] prob_cluster_N[num_patients]; {
vector[C] unnormalised_cluster_N[num_patients]; {
for (p in 1:num_patients) {
for (ci in 1:C) {
unnormalised_cluster_N[p, ci] = 0;
}
}
for (ni in 1:N) {
for (ci in 1:C) {
unnormalised_cluster_N[row_patient_idx[ni], ci] = unnormalised_cluster_N[row_patient_idx[ni], ci] + (
normal_lpdf(y[ni] | mu_gamma[q[ni], ci], sigma_gamma[q[ni], ci])
);
}
}
for (p in 1:num_patients) {
prob_cluster_N[p, ] = exp(unnormalised_cluster_N[p, ] - log_sum_exp(unnormalised_cluster_N[p, ]));
}
}
}
vector<lower=0, upper=1>[C] prob_cluster_S[S]; {
vector[C] unnormalised_cluster_S[S]; {
for (si in 1:S) {
for (ci in 1:C) {
unnormalised_cluster_S[si, ci] = 0;
for (p in 1:num_patients) {
if (patient_syndrome_idx[p] == si) {
unnormalised_cluster_S[si, ci] = unnormalised_cluster_S[si, ci] + prob_cluster_N[p, ci];
}
}
}
prob_cluster_S[si, ] = unnormalised_cluster_S[si, ] / sum(unnormalised_cluster_S[si, ]);
}
}
}
}
```

The items in the `generated quantities`

are an attempt to find the probability that a patient belongs to a particular cluster, and the probability that a syndrome is mostly drawn from a particular cluster.

With this model and a choice of C=5 I get the following for `prob_cluster_N`

(after choosing the representative cluster to be from the largest posterior probability for that patient (across all questions).

```
> table(p_cluster_n_wide$c_idx)
1 2 3 4 5
555 13 3 14 20
```

However, running a (ideally similar) model in `tidyLPA`

yields the following:

```
lpa_model = full_datasets$full_dataframe %>%
as.matrix()%>%
single_imputation() %>%
scale() %>%
estimate_profiles(5)
> table(get_data(lpa_model)$Class)
1 2 3 4 5
98 94 153 54 206
```

The cluster assignments in the stan model are all concentrated towards cluster 1, whereas they are more evenly distributed in the `tidyLPA`

model.

Any ideas why that may be, or where I may have gone wrong would be much appreciated. Thanks!

PS: Note that I am inferring this posterior using VI, not MCMC.