# Validating a mixed effects model using a new dataset

Hi all,

I’m using a mixed effects model to predict the hemoglobin value of blood donors based on their donation histories. I’m using the mixed effect model to account for the random variation between donors that isn’t described by the factors in the data. The most simple model that I’ve used is the following:

y_{it} = \alpha + b_i + \mathbf{x_{it}'} \boldsymbol{\beta} + \epsilon_{it} ,

where b_i \sim N(0,\sigma_b^2) \ \mbox{ and } \ \epsilon_{it} \sim N(0,\sigma_{\epsilon}^2)

Here y_{it} is the Hb-value that I’m trying to predict, \mathbf{x_{it}} are the explanatory variables for donor i at time t, b_i is the donor-specific random intercept and \epsilon is the error term.

I have split my data into two sets: 1) train/test set and 2) validation set. I’m using the set 1) to train the model parameters (\alpha, \beta, \sigma_{b} \mbox{ and } \sigma_{\epsilon} ) and I would like to evaluate its performance using set 2). The problem is that I’m not sure how to use the sampled parameters that I got from running the model on set 1) to do the prediction for set 2).

Basically I want to use the data for new donors in set 2) to estimate the random intercepts for these donors and then use these intercepts along with the previously trained parameters to predict the last Hb values in their donation histories. However, I’m not sure what is the best way to incorporate the already trained fixed parameters and newly sampled parameters for this problem in Bayesian workflow.

At the moment I’m thinking of using the previously sampled parameters to set priors for \alpha \mbox{ and } \beta and then training the model in a similar fashion as I did for set 1). I’m using the means of the sampled sigmas as the scale parameters.

Here is my stan-code for this validation:

data {
// Train data
int<lower=1> N; // Amount of donation events
int<lower=1> Ndon; // Amount of donors in our data
int<lower=1> K; // Amount of covariates
// QR-reparametrization
matrix[N, K] Q_star; // Q_matrix
matrix[K, K] R_star; // R_matrix
matrix[K, K] R_star_inv;
int<lower=1,upper=Ndon> donor[N]; // Donor identifier
vector[N] Hb; // Hemoglobin values for other events than the last one

// Test data
int<lower=1> Ntest; // Amount of test observations
matrix[Ntest, K] x_test; // Test data
int<lower=1, upper=Ndon> test_donor[Ntest]; // Test donor identifier

// Model parameters
int<lower=1> Nsamples; // Amount of train samples. I ran 4 chains with 1000 sampling iterations
vector[Nsamples] alphas;
matrix[Nsamples, K] thetas;
vector[Nsamples] sigmab_vec;
vector[Nsamples] sigmaeps_vec;
}
transformed data {
// Transformed values to set up prior distributions
real alpha_mu = mean(alphas);
real alpha_sig = sd(alphas);
real sigmab = mean(sigmab_vec);
real sigmaeps = mean(sigmaeps_vec);
vector[K] theta_means;
vector[K] theta_sigmas;

for (i in 1:K) {
theta_means[i] = mean(thetas[,i]);
theta_sigmas[i] = sd(thetas[,i]);
}
}
parameters {
real alpha;
vector[K] theta;
vector[Ndon] donb; // Donor specific random intercept for new donors in set 2)
}
model {
// Set informative priors for alpha and theta
donb ~ normal(0, sigmab);
alpha ~ normal(alpha_mu, alpha_sig);
theta ~ normal(theta_means, theta_sigmas);

Hb ~ normal(alpha + Q_star * theta + donb[donor], sigmaeps);
}
generated quantities {
vector[K] beta;
vector[Ntest] y_pred; // Predict the values for test events (last donations for each donor)

beta = R_star_inv * theta;
for (i in 1:Ntest) {
y_pred[i] = normal_rng(alpha + x_test[i,] * beta + donb[test_donor[i]], sigmaeps);
}
}