Random effects model with B-splines: how to generate predictions for test data?

Hi everyone,

This is my first post here and I hope I am doing it right.

So, I am trying to model longitudinal patient data with a random effects model using B-splines. My dataset contains biomarker measurements for patients, where each patient had several measurements over a period of time. My starting point was this great blog post, and my current model looks like this:

data {
    
    int<lower=0> N_rows;                               // number of rows
    int<lower=0> N_patients;                          // number of patients
    vector[N_rows] y;                                      // outcome
    vector[N_rows] t;                                      // time
    int<lower=1> id[N_rows];                         // patient IDs
    int num_basis;
    matrix[num_basis, N_rows] B;                 // B-splines matrix
}

parameters {
    
    row_vector[num_basis] a_raw;
    real theta0;  // intercept
    vector[N_patients] eta0;
    vector[N_patients] eta1;
    real<lower=0> sigma0;
    real<lower=0> sigma1;
    real<lower=0> sigma;
    real<lower=0> tau;
}

transformed parameters {
    row_vector[num_basis] a; // spline coefficients
    a = a_raw*tau;
}

model {
    
    vector[N_rows] mu_n;
    
    // Priors
    a_raw ~ normal(0, 1);
    theta0 ~ normal(0, 1);
    sigma0 ~ normal(0, 1);
    sigma1 ~ normal(0, 1);
    eta0 ~ normal(0, sigma0);
    eta1 ~ normal(0, sigma1);
    tau ~ normal(0, 1);
    sigma ~ normal(0, 2.5);
    

    //Likelihood
    for(n in 1:N_rows) {
        mu_n[n] = theta0 + eta0[id[n]] + dot_product(a, col(B, n)) + eta1[id[n]] * t[n];
        y[n]       ~ normal(mu_n[n], sigma);
    }
}

generated quantities {
    
    vector[N_rows] predictions;
    vector[N_rows] pred_mu;
    
    for(n in 1:N_rows) {
        pred_mu[n] = (theta0 + eta0[id[n]]) + dot_product(a, col(B, n)) + eta1[id[n]] * t[n];
        predictions[n] = normal_rng(pred_mu[n], sigma);
    }
}

Or, to try to put it into mathematical notation:
\mu_i = \theta_0 + \eta_{i0} + a^T B_{i:} + \eta_{i1} t_i + \epsilon_i,
y \sim \mathcal{N}(\mu_i, \sigma),
where B represents a natural cubic spline matrix of t.

In generated quantities I am able to get predictions for the train data, however, I am wondering how would I get predictions for new patients under the assumption that I already have some of their biomarker measurements available (for example: let’s say I have 5 measurements from first 2 years and I want to predict what happens next 2 years). Since the random effect parameters \eta are patient-specific I am having difficulties to understand how to generate those for completely new patients.

I know there are some libraries offering those functionalities (JMbayes if I am not mistaken), however, I would like to understand how to do it myself.

Any help will be appreciated!

Welcome to the Stan forum!

You’re almost there. Importantly, you avoided the usual pitfall of not accounting for the sampling uncertainty due to \sigma. For random effects, you just draw new random effects in the generated quantities block according to their prior and plug them in. This simulates drawing a new item from the population. If you want to work out the math, you can show that this marginalizes out the choice of the new parameter (that’s what you get in MCMC from introducing a parameter, giving it the right distribution, then ignoring it in the output).

For more information, there’s the chapter of the User’s Guide on posterior predictive sampling:

You might also want to check out Andrew’s and my JRSS C paper on Covid, where we use this trick with some measurements to predict sensitivity in a new test site:

Thank you for your answer and a warm welcome to the community!

If I understand correctly, drawing new random effects is what we would do if we had no data available on the new patients. However, in my case there will be some measurements available (for example: let’s say I have 5 measurements from the first 2 years and I want to predict what happens next 2 years), so I should probably condition on it?

I think the solution is in these slides, pdf page 200 - section 6.2 Longitudinal outcomes predictions (cont’d), where in step 1 they draw the population-level parameters, and in step 2 they draw the new random effects conditional on available observations and population-level parameters. But I am having difficulties translating this into Stan code.

1 Like