I’m trying to fit what I would call a “growth mixture model” (which I think is sometimes called a “latent class linear mixed model”). A binary outcome is measured at multiple timepoints for multiple participants. Each participant contributes a different number of observations, and the observations are not evenly spaced. The goal is to model trends in the probability of experiencing the binary outcome over time. We believe that there are a finite number of common trajectory classes, but we don’t know which participants fall in which class. So we need to combine mixture modeling (to handle the latent classes) with multilevel modeling (to handle repeated observations on the same participants).

I will try to describe the model likelihood formally: Let i index the participant and j index the observation. Let y_{ij} be the binary outcome and X the covariate matrix (here, the only covariates are polynomials of time). Let k be a categorical variable indicating membership in a particular latent class, and p be a simplex vector of latent class proportions. Let \beta_{ik} be the vector of regression coefficients for participant i if s/he is in class k. Finally, let \gamma_k and \sigma_k be the mean vector and covariance matrix of the multivariate normal from which participants’ \beta_{ik} are drawn if they are in class k. Then:

y_{ij}|k \sim Bernoulli(\mu_{ijk})

logit(\mu_{ijk}) = X_{ij} \beta_{ik}

\beta_{ik} \sim Gaussian(\gamma_{k},\sigma^2_{k})

k \sim Categorical(p)

Now here is some Stan code attempting to fit the model.

```
data {
int<lower=0> I; // number of observations
int<lower=0> J; // number of participants
int<lower=1> K; // number of mixture components
int<lower = 1,upper = J> id[I]; // id vector that maps observations to participants
int<lower=0,upper=1> use[I]; // outcome
matrix[I,4] model_matrix; // time covariates
}
parameters {
simplex[K] lambda; // weight for each component of mixture
matrix[J,4] beta[K]; // each participants’ coefficients, conditional on mixture
vector[4] beta_mean[K]; // grand mean of coefficients, conditional on mixture
cov_matrix[4]<lower=0> beta_variance[K]; // variance of coefficients, conditional on mixture
}
model {
real lps[K];
for(i in 1:I) { // loop over participants
for(k in 1:K) { // loop over mixture components
las[k] = log(lambda[k]);
lps[k] += bernoulli_logit_lpmf(use[i]|model_matrix[i]*beta[k][id[i]]) +
normal_lpdf(beta[k][id[i]]|beta_mean,beta_variance);
// add log probability of mixture
}
target += log_sum_exp(lps); // update log probability of model by combining both mixture components
}
}
```

My questions:

- Does this actually fit the model I described? I can’t quite figure out how to loop over the participants, or correctly marginalize out the latent classes.
- Is there a way to make this code more efficient?
- Are there smart, non-informative prior choices? (Right now, I think the code gives everything a flat prior by default?)

This is my first Stan project, so thanks in advance for your guidance and your patience.

(P.S. If anyone knows any Stan users at Johns Hopkins, please let me know.)