Finite mixture regression models with varying intercepts and slopes

Hi!

I am working on a research project in which two treatments are evaluated by a randomized clinical trial and the outcome is measured in three moments (before de intervention, right after the intervention and one month follow-up).

I want to evaluate the differential treatment effect in the slope, similar to the CD4 model in ARM book (Chapter 20, section 5):

y_i \sim N(\alpha_{ji} + \beta_{ji} t_i, \sigma_y)

with \alpha and \beta modeled by a multivariate normal with mean parameters \gamma^{\alpha}_0 and \gamma^{\beta}_0 + \gamma^{\beta}_1 t_j, in which t is a binary treatment indicator.

So far, so good, I can fit such a model with ease. But I hypothetize that there are two latent classes: some subjects have better results with treatment 1, and some have better results with treatment 2. A finite mixture model is the obvious choice, but I couldn’t find a model in the literature that corresponds exactly to my problem.

At first, I implemented a finite mixture regression model without the hierarchical component, and the model definition was quite clear, and Stan recovered all parameters beautifully.

But when the model has two levels, how the mixture should be defined? Fermund and Magidson (https://jeroenvermunt.nl/gfkl2004a.pdf) suggests mixture in both levels, but I don’t think it makes sense, in my case, to consider mixture at the single measurement level.

So far, I implemented a model (code below) in which the mixture component applies only to the second level (varying intercepts and slopes), and it also recovers all parameters correctly (I think). But I really would like some input in how to better model this problem. Papers about similar models are welcome!

Thanks!
Erikson

data {
  int<lower=1> N;
  int<lower=1> J;
  vector[J] t;
  vector[N] time;
  vector[N] y;
  int id[N];
}

parameters {
  // Group model
  real<lower=0, upper=1> theta;

  // Intercept model
  vector[2] muAlpha;
  vector<lower=0>[2] sigmaAlpha;
  vector[J] alpha;
  
  ordered[2] muBeta;
  ordered[2] tau;
  vector<lower=0>[2] sigmaBeta;
  vector[J] beta;
  
  corr_matrix[2] Omega0;
  corr_matrix[2] Omega1;
  
  real<lower=0> sigma;
}

model {
  theta ~ beta(5, 5);

  muAlpha ~ normal(0, 10);
  tau ~ normal(0, 10);
  sigmaAlpha ~ lognormal(0, 0.3);
  
  muBeta ~ normal(0, 10);
  sigmaBeta ~ lognormal(0, 0.3);
  
  sigma ~ student_t(5, 0, 3);
  
  Omega0 ~ lkj_corr(3);
  Omega1 ~ lkj_corr(3);
  
  {
    matrix[J, 2] regPars;
    matrix[J, 2] mu0;
    matrix[J, 2] mu1;
    vector[2] sigma0;
    vector[2] sigma1;
    
    regPars[, 1] = alpha;
    regPars[, 2] = beta;
    
    mu0[, 1] = rep_vector(muAlpha[1], J);
    mu0[, 2] = muBeta[1] + tau[2] * t;
    
    mu1[, 1] = rep_vector(muAlpha[2], J);
    mu1[, 2] = muBeta[2] + tau[1] * t;
    
    sigma0[1] = sigmaAlpha[1];
    sigma0[2] = sigmaBeta[1];
    
    sigma1[1] = sigmaAlpha[2];
    sigma1[2] = sigmaBeta[2];
    
    for (j in 1:J) {
        target += log_mix(theta,
                          multi_normal_lpdf(regPars[j, ] | mu0[j, ],
                          quad_form_diag(Omega0, sigma0)),
                          multi_normal_lpdf(regPars[j, ] | mu1[j, ], 
                          quad_form_diag(Omega1, sigma1)));
                  
    }
  }
  
  y ~ normal(alpha[id] + beta[id] .* time, sigma);

}
1 Like

A well written question! I would however be skeptical about the possibility of finding a “best” or even “universally good” model for a problem by theoretical speculation. If I understand it correctly, you have already gathered the data? If so, my recommendation would be to run some posterior predictive (PP) checks for the model without mixture. If you don’t find any discrepancies in the PP checks, a more complex model will likely only overfit. If you do find discrepancies, those should motivate how to expand the model (i.e. how the mixture should behave).

A concrete example: you could use your model to predict the trajectories of all patients. Visualize the posterior and actual trajectories separately for each treatment (or for each patient). If fitted posterior trajectories show a narrow range of slopes, but the actual trajectories have a wide range of slopes and frequently go “outside” the posterior trajectories (or even better, if there are clusters of slopes) this might mean a mixture for the slopes is justified.

Hope that does make sense to you :-)

As for implementation there are likely some speedups to be had using multi_normal_cholesky and lkj_corr_cholesky (I believe the manual has some examples)

2 Likes

Thank you for your suggestion, Martin! Your idea of using posterior predictive checks to evaluate if the mixture model is necessary sounds great, I haven’t thought about that. I want to compare the mixture to the simple(r ) model using LOO and cross-validation (mean square error of the prediction of a fourth point in time, not included in data to compute the posterior) but your idea sounds even more informative (checking if slopes cluster together, e.g.).

Actually, the data haven’t been collected yet. I am designing a pragmatic randomized trial but I want to make sure I understand the model I am proposing to evaluate it and also to use fake data to decide on the sample size.

Thank you for your suggestions!

P.S.: Indeed, using the Cholesky implementation of the multivariate normal speeds up the MCMC sampling. In case someone else wants to implement a similar model, my code for the multilevel mixture is currently (besides the cholesky implementation of the multi_normal, the first level measurements are accounted for in the log_mix function directly):


data {
  int<lower=1> N;
  int<lower=1> J;
  vector[J] t;
  vector[N] time;
  vector[N] y;
  int id[N];
  int idx[J, 3];
}

parameters {
  // Group model
  real<lower=0, upper=1> theta;

  // Intercept model
  vector[2] muAlpha;
  vector<lower=0>[2] sigmaAlpha;
  vector[J] alpha;
  
  ordered[2] muBeta;
  ordered[2] tau;
  vector<lower=0>[2] sigmaBeta;
  vector[J] beta;
  
  cholesky_factor_corr[2] Omega0;
  cholesky_factor_corr[2] Omega1;
  
  vector<lower=0>[2] sigma;
}

model {
  theta ~ beta(5, 5);

  muAlpha ~ normal(0, 10);
  tau ~ normal(0, 10);
  sigmaAlpha ~ lognormal(0, 0.3);
  
  muBeta ~ normal(0, 10);
  sigmaBeta ~ lognormal(0, 0.3);
  
  sigma ~ student_t(5, 0, 3);
  
  Omega0 ~ lkj_corr_cholesky(3);
  Omega1 ~ lkj_corr_cholesky(3);
  
  {
    matrix[J, 2] regPars;
    matrix[J, 2] mu0;
    matrix[J, 2] mu1;
    vector[2] sigma0;
    vector[2] sigma1;
    
    regPars[, 1] = alpha;
    regPars[, 2] = beta;
    
    mu0[, 1] = rep_vector(muAlpha[1], J);
    mu0[, 2] = muBeta[1] + tau[2] * t;
    
    mu1[, 1] = rep_vector(muAlpha[2], J);
    mu1[, 2] = muBeta[2] + tau[1] * t;
    
    sigma0[1] = sigmaAlpha[1];
    sigma0[2] = sigmaBeta[1];
    
    sigma1[1] = sigmaAlpha[2];
    sigma1[2] = sigmaBeta[2];
    
    for (j in 1:J) {
      vector[3] sub_y;
      vector[3] sub_time;
      
      sub_y = y[idx[j,]];
      sub_time = time[idx[j,]];
        target += log_mix(theta,
                          multi_normal_cholesky_lpdf(regPars[j, ] | mu0[j, ],
                          diag_pre_multiply(sigma0, Omega0)) + 
                          normal_lpdf(sub_y | alpha[j] + beta[j] * sub_time, sigma[1]),
                          multi_normal_cholesky_lpdf(regPars[j, ] | mu1[j, ], 
                          diag_pre_multiply(sigma1, Omega1)) + 
                          normal_lpdf(sub_y | alpha[j] + beta[j] * sub_time, sigma[2]));
                  
    }
  }
  
  //y ~ normal(alpha[id] + beta[id] .* time, sigma);

}
 

Just to clear things up - if I understand it correctly, PP checks and LOO/CV/predicting further datapoints serve different purposes. With LOO / validation dataset you are evaluating prediction (which is a good thing if that’s your focus). LOO will explicitly consider the full distribution of your model, including tails etc. LOO is quantitative. On the other hand PP checks will inform you, if there are qualitative problems with your model. It is well possible to have a model that improves LOO but does not pass (some) PP checks and vice versa.

That’s great to hear, it sure is a good practice. What I would suggest is that you use both models and see how the required sample size changes and also to test your ability to distinguish between them. This could be by looking how LOO changes between the correct and wrong model of simulated data or to design a PP check that reliably shows problems when the simpler model is not good enough.

To be really safe, it might also make sense to write simulators for even more complex/different models (say varying intercept drawn from a heavy-tailed distribution or a mixture of more components than you fit or the response being skew-normal) and test how your two simple models fare against data generated from those other models to check for robustness.

Best of luck with the trial!

1 Like

Martin, some feedback about your suggestions. They were really important to calibrate my expectations!

Following your suggestion for a PP check, I simulated data for the posterior predictive distribution of new subjects using the simple ( r ) multilevel model without the mixture component, considering some range of differences between between latent classes. Then, I computed the simple least-squares slope for each original subject and for the ‘new’ subjects in each simulated dataset. Below is a resulting graph for one set of latent class values:

The original data produces an obvious bimodal density, which can be evaluated by the (negative) excess kurtosis of the original data compared to the posterior predictive data based on the non-mixture model.

But if the difference between latent classes is small, the true model does not provide enough evidence to diferentiate between both models at least under sample sizes that are compatible with my resources.