Validating a mixed effects model using a new dataset

Hi all,

I’m using a mixed effects model to predict the hemoglobin value of blood donors based on their donation histories. I’m using the mixed effect model to account for the random variation between donors that isn’t described by the factors in the data. The most simple model that I’ve used is the following:

y_{it} = \alpha + b_i + \mathbf{x_{it}'} \boldsymbol{\beta} + \epsilon_{it} ,

where b_i \sim N(0,\sigma_b^2) \ \mbox{ and } \ \epsilon_{it} \sim N(0,\sigma_{\epsilon}^2)

Here y_{it} is the Hb-value that I’m trying to predict, \mathbf{x_{it}} are the explanatory variables for donor i at time t, b_i is the donor-specific random intercept and \epsilon is the error term.

I have split my data into two sets: 1) train/test set and 2) validation set. I’m using the set 1) to train the model parameters (\alpha, \beta, \sigma_{b} \mbox{ and } \sigma_{\epsilon} ) and I would like to evaluate its performance using set 2). The problem is that I’m not sure how to use the sampled parameters that I got from running the model on set 1) to do the prediction for set 2).

Basically I want to use the data for new donors in set 2) to estimate the random intercepts for these donors and then use these intercepts along with the previously trained parameters to predict the last Hb values in their donation histories. However, I’m not sure what is the best way to incorporate the already trained fixed parameters and newly sampled parameters for this problem in Bayesian workflow.

At the moment I’m thinking of using the previously sampled parameters to set priors for \alpha \mbox{ and } \beta and then training the model in a similar fashion as I did for set 1). I’m using the means of the sampled sigmas as the scale parameters.

Here is my stan-code for this validation:


data {
    // Train data
    int<lower=1> N; // Amount of donation events
    int<lower=1> Ndon; // Amount of donors in our data
    int<lower=1> K; // Amount of covariates
    // QR-reparametrization
    matrix[N, K] Q_star; // Q_matrix
    matrix[K, K] R_star; // R_matrix
    matrix[K, K] R_star_inv;
    int<lower=1,upper=Ndon> donor[N]; // Donor identifier
    vector[N] Hb; // Hemoglobin values for other events than the last one

    // Test data
    int<lower=1> Ntest; // Amount of test observations
    matrix[Ntest, K] x_test; // Test data
    int<lower=1, upper=Ndon> test_donor[Ntest]; // Test donor identifier

    // Model parameters
    int<lower=1> Nsamples; // Amount of train samples. I ran 4 chains with 1000 sampling iterations
    vector[Nsamples] alphas;
    matrix[Nsamples, K] thetas;
    vector[Nsamples] sigmab_vec;
    vector[Nsamples] sigmaeps_vec;
}
transformed data {
    // Transformed values to set up prior distributions
    real alpha_mu = mean(alphas);
    real alpha_sig = sd(alphas);
    real sigmab = mean(sigmab_vec);
    real sigmaeps = mean(sigmaeps_vec);
    vector[K] theta_means;
    vector[K] theta_sigmas;

    for (i in 1:K) {
        theta_means[i] = mean(thetas[,i]);
        theta_sigmas[i] = sd(thetas[,i]);
    }
}
parameters {
    real alpha;
    vector[K] theta;
    vector[Ndon] donb; // Donor specific random intercept for new donors in set 2)
}
model {
    // Set informative priors for alpha and theta
    donb ~ normal(0, sigmab);
    alpha ~ normal(alpha_mu, alpha_sig);
    theta ~ normal(theta_means, theta_sigmas);

    Hb ~ normal(alpha + Q_star * theta + donb[donor], sigmaeps);
}
generated quantities {
    vector[K] beta;
    vector[Ntest] y_pred; // Predict the values for test events (last donations for each donor)

    beta = R_star_inv * theta;
    for (i in 1:Ntest) {
        y_pred[i] = normal_rng(alpha + x_test[i,] * beta + donb[test_donor[i]], sigmaeps);
    }
}

In conclusion here are the main questions I have about this approach:

  1. Can this be considered as model validation if I use the sampled values of \alpha \mbox{ and } \beta to set informative priors instead of using fixed values?
  2. If not, how should I use the distributions of \alpha \mbox{ and } \beta to effectively train the random intercepts? Using point estimates would seem like a crude approach.
  3. Should I use the previously sampled \sigma_b for new these new donors or should I have a weakly informative prior like I had when training the model?

Many thanks in advance