# Gaussian Process out-of-sample predictive distribution

I’m trying to expand upon a model I built previously by using Gaussian Processes.

The data that I have is t-distributed that can be grouped hierarchically and modeled with random effects. In addition, the observations of the population can also be grouped together by week.

I’m trying to take my random effects model that makes use of unit-level covariates and group-level covariates and add in a linear term to the mean function of the likelihood that models the time trend.

I was able to write a stan program that I believe achieves this through Gaussian processes. The function gp keyed by week is the sum of a long term trend, a short term trend, an autoregressive component, and gaussian white noise. I believe this program models what I want to achieve correctly. The function gp is fed into the mean likelihood distribution along with the unit and group coefficients.

I’m not entirely sure I’m drawing from the multivariate distribution correctly, i.e. I might not be combining the kernels correctly. Is anyone able to verify?

The program is shown below:

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=1> J; //Number of groups
int<lower=1> K; //Number of group predictors
int<lower=1> O; //Number of observation predictors
int<lower=1> T; //Number of time indexes.
int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
real response[N]; // The outcome of the data generation process
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
}
transformed data {
real t_gp_vec[T];
real jitter = 1e-12;
for (t in 1:T)
t_gp_vec[t] = t;
}
parameters {
real<lower=2> nu;
real<lower=0> inv_phi;
vector[O] beta;
vector[K] gamma;
vector[J] group_mu_raw;
real<lower=0> sigma_group_mu;

// GP prior parameters
// Noise component
vector[T] t_noise_raw;
real<lower=0> sigma_noise;

// Autoregressive GP Component
vector[T] gp_raw_ar;
real<lower=0> gp_len_ar;
real<lower=0> sigma_gp_ar;

// Long-term Trend GP Component
vector[T] gp_raw_trend_long;
real<lower=0> gp_len_trend_long;
real<lower=0> sigma_gp_trend_long;

// Short-term Trend GP Component
vector[T] gp_raw_trend_short;
real<lower=0> gp_len_trend_short;
real<lower=0> sigma_gp_trend_short;
}
transformed parameters {
vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
vector[N] mu =
group_mu[group_idx] +
observation_data * beta;

real phi = 1 / inv_phi;

vector[T] gp_ar;
vector[T] gp_trend_short;
vector[T] gp_trend_long;
vector[T] gp;
vector[T] t_noise = sigma_noise * t_noise_raw;

{
matrix[T, T] C = gp_matern32_cov(t_gp_vec, sigma_gp_ar, gp_len_ar);
matrix[T, T] L_C;
for (t in 1:T)
C[t,t] += jitter;
L_C = cholesky_decompose(C);
gp_ar = L_C * gp_raw_ar;
}

{
matrix[T, T] C = gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_short, gp_len_trend_short);
matrix[T, T] L_C;
for (t in 1:T)
C[t,t] += jitter;
L_C = cholesky_decompose(C);
gp_trend_short = L_C * gp_raw_trend_short;
}

{
matrix[T, T] C = gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_long, gp_len_trend_long);
matrix[T, T] L_C;
for (t in 1:T)
C[t,t] += jitter;
L_C = cholesky_decompose(C);
gp_trend_long = L_C * gp_raw_trend_long;
}

// gp is sum of monthly noise and the smoothly varying processes and the autoregressive process
gp = t_noise + gp_ar + gp_trend_short + gp_trend_long;
}
model {
nu ~ gamma(2, 0.1);
inv_phi ~ student_t(4, 0, 1);

beta ~ student_t(4, 0, 1);
gamma ~ student_t(4, 0, 1);

group_mu_raw ~ student_t(4, 0, 1);
sigma_group_mu  ~ student_t(4, 0, 1);

// GP priors
// Noise component
sigma_noise ~ normal(0, 1);
t_noise_raw ~ normal(0, 1);

// Autoregressive Component
gp_raw_ar ~ normal(0, 1);
gp_len_ar ~ inv_gamma(5, 5);
sigma_gp_ar ~ normal(0, 1);

// Short-term Trend Component
gp_raw_trend_short ~ normal(0, 1);
gp_len_trend_short ~ inv_gamma(5, 5);
sigma_gp_trend_short ~ normal(0, 1);

// Long-term Trend Component
gp_raw_trend_long ~ normal(0, 1);
gp_len_trend_long ~ inv_gamma(5, 5);
sigma_gp_trend_long ~ normal(0, 1);

// likelihood function
response ~ student_t(nu, mu + gp[t_idx], phi);

}
generated quantities {
real y_rep[N];
for (n in 1:N) {
y_rep[n] = student_t_rng(nu, mu[n] + gp[t_idx[n]], phi);
}
}


However, I am now trying to generate out-of-sample predictions with the model and am running into some difficulty. I was wondering if the below code is the right way to generate those predictions. In particular, I’m not sure I’m handling the GP part correctly.

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=0> J; //Number of groups
int<lower=0> K; //Number of group predictors
int<lower=0> O; //Number of observation predictor
int<lower=1> T; //Number of time indexes.
int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
real response[N]; // The outcome of the data generation process
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
int<lower=0> max_group_idx; //Highest group index fit in model
int<lower=0> max_t_idx; //Highest time index fit in model

int N_samples; // number of samples drawn from posterior distributions
real nu[N_samples]; // nu parameter from student t likelihood
real phi[N_samples]; // standard deviation parameter from student t likelihood
matrix[N_samples, O] beta; // covariate matrix draws for observations
matrix[N_samples, K] gamma; // covariate matrix draws for groups
matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.

// GP posterior samples
// Noise component
matrix[N_samples, max_t_idx] t_noise;
real sigma_noise[N_samples];

// Auto-regressive component
real sigma_gp_ar[N_samples];
real gp_len_ar[N_samples];
matrix[N_samples, max_t_idx] gp_raw_ar;

real sigma_gp_trend_short[N_samples];
real gp_len_trend_short[N_samples];
matrix[N_samples, max_t_idx] gp_raw_trend_short;

real sigma_gp_trend_long[N_samples];
real gp_len_trend_long[N_samples];
matrix[N_samples, max_t_idx] gp_raw_trend_long;
}
transformed data {
real t_gp_vec[T];
real jitter = 1e-12;
for (t in 1:T)
t_gp_vec[t] = t;
}
parameters {
}
model {
}
generated quantities {
matrix[N_samples, N] y_pred;
matrix[N_samples, T] gp;

// generate the gp vector from the posterior samples.
for(s in 1:N_samples) {
vector[T] gp_ar;
vector[T] gp_trend_short;
vector[T] gp_trend_long;

vector[T] gp_raw_ar_full;
vector[T] gp_raw_trend_short_full;
vector[T] gp_raw_trend_long_full;
vector[T] t_noise_full;

for(t in 1:max_t_idx){
gp_raw_ar_full[t] = gp_raw_ar[s, t];
gp_raw_trend_short_full[t] = gp_raw_trend_short[s, t];
gp_raw_trend_long_full[t] = gp_raw_trend_long[s, t];
t_noise_full[t] = t_noise[s, t];
}
for(t in max_t_idx+1:T){
gp_raw_ar_full[t] = normal_rng(0, 1);
gp_raw_trend_short_full[t] = normal_rng(0, 1);
gp_raw_trend_long_full[t] = normal_rng(0, 1);
t_noise_full[t] = normal_rng(0, sigma_noise[s]);
}

{
matrix[T, T] C = gp_matern32_cov(t_gp_vec, sigma_gp_ar[s], gp_len_ar[s]);
matrix[T, T] L_C;
for (t in 1:T)
C[t, t] += jitter;
L_C = cholesky_decompose(C);
gp_ar = L_C * gp_raw_ar_full;
}

{
matrix[T, T] C = gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
matrix[T, T] L_C;
for (t in 1:T)
C[t, t] += jitter;
L_C = cholesky_decompose(C);
gp_trend_short = L_C * gp_raw_trend_short_full;
}

{
matrix[T, T] C = gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);
matrix[T, T] L_C;
for (t in 1:T)
C[t, t] += jitter;
L_C = cholesky_decompose(C);
gp_trend_long = L_C * gp_raw_trend_long_full;
}

gp[s] = to_row_vector(t_noise_full + gp_ar + gp_trend_short + gp_trend_long);
}

for(n in 1:N) {
row_vector[N_samples] group_mu_rep;
row_vector[N_samples] mu;

if (group_idx[n] > max_group_idx) {
for (s in 1:N_samples) {
group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
}
}
else {
group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
}

mu = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n])) + to_row_vector(gp[,t_idx[n]]);

y_pred[,n] = to_vector(student_t_rng(nu, mu, phi));

}
}


Is there any literature on combining multiple kernels to generate out-of-sample data, especially when the observation process is not normally distributed?

UPDATE: here is what I believe to be the correct process now:

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=1> J; //Number of groups
int<lower=1> K; //Number of group predictors
int<lower=1> O; //Number of observation predictors
int<lower=1> T; //Number of time indexes.
int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
real response[N]; // The outcome of the data generation process
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
}
transformed data {
real t_gp_vec[T];
real jitter = 1e-12;
for (t in 1:T)
t_gp_vec[t] = t;
}
parameters {
real<lower=2> nu;
real<lower=0> inv_phi;
vector[O] beta;
vector[K] gamma;
vector[J] group_mu_raw;
real<lower=0> sigma_group_mu;

// GP prior parameters
// Noise component
vector[T] t_noise_raw;
vector[T] gp_z;
real<lower=0> sigma_noise;

// Autoregressive GP Component
real<lower=0> gp_len_ar;
real<lower=0> sigma_gp_ar;

// Long-term Trend GP Component
real<lower=0> gp_len_trend_long;
real<lower=0> sigma_gp_trend_long;

// Short-term Trend GP Component
real<lower=0> gp_len_trend_short;
real<lower=0> sigma_gp_trend_short;
}
transformed parameters {
vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
vector[N] mu =
group_mu[group_idx] +
observation_data * beta;

real phi = 1 / inv_phi;

vector[T] f;
vector[T] gp;
vector[T] t_noise = sigma_noise * t_noise_raw;

{
matrix[T, T] C;
matrix[T, T] L_C;

C = gp_matern32_cov(t_gp_vec, sigma_gp_ar, gp_len_ar);
for (t in 1:T)
C[t,t] += jitter;
L_C = cholesky_decompose(C);
f = L_C * gp_z;
}

// gp is sum of monthly noise and the smoothly varying processes and the autoregressive process
gp = t_noise + f;
}
model {
nu ~ gamma(2, 0.1);
inv_phi ~ student_t(4, 0, 1);

beta ~ student_t(4, 0, 1);
gamma ~ student_t(4, 0, 1);

group_mu_raw ~ student_t(4, 0, 1);
sigma_group_mu  ~ student_t(4, 0, 1);

// GP priors
// Noise component
gp_z ~ normal(0, 1);
sigma_noise ~ normal(0, 1);
t_noise_raw ~ normal(0, 1);

// Autoregressive Component
gp_len_ar ~ inv_gamma(5, 5);
sigma_gp_ar ~ normal(0, 1);

// Short-term Trend Component
gp_len_trend_short ~ inv_gamma(5, 5);
sigma_gp_trend_short ~ normal(0, 1);

// Long-term Trend Component
gp_len_trend_long ~ inv_gamma(5, 5);
sigma_gp_trend_long ~ normal(0, 1);

// likelihood function
response ~ student_t(nu, mu + gp[t_idx], phi);

}
generated quantities {
real y_rep[N];
for (n in 1:N) {
y_rep[n] = student_t_rng(nu, mu[n] + gp[t_idx[n]], phi);
}
}


and the associated out-of-sample draws:

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=0> J; //Number of groups
int<lower=0> K; //Number of group predictors
int<lower=0> O; //Number of observation predictor
int<lower=1> T; //Number of time indexes.
int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
real response[N]; // The outcome of the data generation process
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
int<lower=0> max_group_idx; //Highest group index fit in model
int<lower=0> max_t_idx; //Highest time index fit in model

int N_samples; // number of samples drawn from posterior distributions
real nu[N_samples]; // nu parameter from student t likelihood
real phi[N_samples]; // standard deviation parameter from student t likelihood
matrix[N_samples, O] beta; // covariate matrix draws for observations
matrix[N_samples, K] gamma; // covariate matrix draws for groups
matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.

// GP posterior samples
// Noise component
matrix[N_samples, max_t_idx] t_noise;
matrix[N_samples, max_t_idx] gp_z;
real sigma_noise[N_samples];

// Auto-regressive component
real sigma_gp_ar[N_samples];
real gp_len_ar[N_samples];

real sigma_gp_trend_short[N_samples];
real gp_len_trend_short[N_samples];

real sigma_gp_trend_long[N_samples];
real gp_len_trend_long[N_samples];
}
transformed data {
real t_gp_vec[T];
real jitter = 1e-12;
for (t in 1:T)
t_gp_vec[t] = t;
}
parameters {
}
model {
}
generated quantities {
matrix[N_samples, N] y_pred;
matrix[N_samples, T] gp;

// generate the gp vector from the posterior samples.
for(s in 1:N_samples) {
vector[T] f;

vector[T] gp_z_full;
vector[T] t_noise_full;

for(t in 1:max_t_idx){
t_noise_full[t] = t_noise[s, t];
gp_z_full[t] = gp_z[s, t];
}
for(t in max_t_idx+1:T){
t_noise_full[t] = normal_rng(0, sigma_noise[s]);
gp_z_full[t] = normal_rng(0, 1);
}

{
matrix[T, T] C;
matrix[T, T] L_C;

C = gp_matern32_cov(t_gp_vec, sigma_gp_ar[s], gp_len_ar[s]);
for (t in 1:T)
C[t, t] += jitter;
L_C = cholesky_decompose(C);
f = L_C * gp_z_full;
}

gp[s] = to_row_vector(t_noise_full + f);
}

for(n in 1:N) {
row_vector[N_samples] group_mu_rep;
row_vector[N_samples] mu;

if (group_idx[n] > max_group_idx) {
for (s in 1:N_samples) {
group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
}
}
else {
group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
}

mu = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n])) + to_row_vector(gp[,t_idx[n]]);

y_pred[,n] = to_vector(student_t_rng(nu, mu, phi));

}
}


I think I can say that my approach in the training part is actually wrong.

The Gaussian Process gp with the above covariance matrices is distributed as so

\textbf{gp} \sim N (\textbf{0}, \Sigma_1 + \Sigma_2 + \Sigma_3) .

Note that if \Sigma_1 + \Sigma_2 + \Sigma_3 = L L^T for some matrix L, then we can say that

\textbf{gp} = \textbf{0} + L \textbf{Z}, \ Z_i \sim N(0, 1),

i.e. I should be taking the sum of the covariance matrices and then taking the cholesky decomposition of the sum and multiplying by the iid Gaussian RVs.

I am still slightly confused by the out-of-sample generation though.I think can repeat this approach with out-of-sample predictions by using the posterior estimates for length-scale and amplitude to estimate the covariance matrix sum, take the cholesky decomposition and just generate iid Gaussian RVs for both the training and out-of-sample time periods.

Is this correct?

1 Like

@anon79882417, do you have any experience with this topic of out-of-sample predictive distribution draws?

1 Like

I found some literature here: http://www.gaussianprocess.org/gpml/chapters/RW2.pdf

Eq 2.22 gives the analytical form of the posterior predictive distribution. Following that made it obvious what the predictive step should be.

For posterity, here are the updated train/prediction steps:

train:

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=1> J; //Number of groups
int<lower=1> K; //Number of group predictors
int<lower=1> O; //Number of observation predictors
int<lower=1> T; //Number of time indexes.
int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
real response[N]; // The outcome of the data generation process
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
}
transformed data {
real t_gp_vec[T];
real jitter = 1e-12;
for (t in 1:T)
t_gp_vec[t] = t;
}
parameters {
real<lower=2> nu;
real<lower=0> inv_phi;
vector[O] beta;
vector[K] gamma;
vector[J] group_mu_raw;
real<lower=0> sigma_group_mu;

// GP prior parameters
// Noise component
vector[T] t_noise_raw;
vector[T] gp_z;
real<lower=0> sigma_noise;

// Autoregressive GP Component
real<lower=0> gp_len_ar;
real<lower=0> sigma_gp_ar;

// Long-term Trend GP Component
real<lower=0> gp_len_trend_long;
real<lower=0> sigma_gp_trend_long;

// Short-term Trend GP Component
real<lower=0> gp_len_trend_short;
real<lower=0> sigma_gp_trend_short;
}
transformed parameters {
vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
vector[N] mu =
group_mu[group_idx] +
observation_data * beta;

real phi = 1 / inv_phi;

vector[T] f;
vector[T] gp;
vector[T] t_noise = sigma_noise * t_noise_raw;

{
matrix[T, T] C;
matrix[T, T] L_C;

C = gp_matern32_cov(t_gp_vec, sigma_gp_ar, gp_len_ar);
for (t in 1:T)
C[t,t] += jitter;
L_C = cholesky_decompose(C);
f = L_C * gp_z;
}

// gp is sum of monthly noise and the smoothly varying processes and the autoregressive process
gp = t_noise + f;
}
model {
nu ~ gamma(2, 0.1);
inv_phi ~ student_t(4, 0, 1);

beta ~ student_t(4, 0, 1);
gamma ~ student_t(4, 0, 1);

group_mu_raw ~ student_t(4, 0, 1);
sigma_group_mu  ~ student_t(4, 0, 1);

// GP priors
// Noise component
gp_z ~ normal(0, 1);
sigma_noise ~ normal(0, 1);
t_noise_raw ~ normal(0, 1);

// Autoregressive Component
gp_len_ar ~ inv_gamma(5, 5);
sigma_gp_ar ~ normal(0, 1);

// Short-term Trend Component
gp_len_trend_short ~ inv_gamma(5, 5);
sigma_gp_trend_short ~ normal(0, 1);

// Long-term Trend Component
gp_len_trend_long ~ inv_gamma(5, 5);
sigma_gp_trend_long ~ normal(0, 1);

// likelihood function
response ~ student_t(nu, mu + gp[t_idx], phi);

}
generated quantities {
real y_rep[N];
for (n in 1:N) {
y_rep[n] = student_t_rng(nu, mu[n] + gp[t_idx[n]], phi);
}
}


prediction:

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=0> J; //Number of groups
int<lower=0> K; //Number of group predictors
int<lower=0> O; //Number of observation predictor
int<lower=1> T; //Number of time indexes.
int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
int<lower=0> max_group_idx; //Highest group index fit in model
int<lower=0> max_t_idx; //Highest time index fit in model
real response[N]; // The outcome of the data generation process
real response_by_t[max_t_idx]; //observed average for gaussian process.

int N_samples; // number of samples drawn from posterior distributions
real nu[N_samples]; // nu parameter from student t likelihood
real phi[N_samples]; // standard deviation parameter from student t likelihood
matrix[N_samples, O] beta; // covariate matrix draws for observations
matrix[N_samples, K] gamma; // covariate matrix draws for groups
matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.

// GP posterior samples
// Noise component
matrix[N_samples, max_t_idx] t_noise;
matrix[N_samples, max_t_idx] gp_z;
real sigma_noise[N_samples];

// Auto-regressive component
real sigma_gp_ar[N_samples];
real gp_len_ar[N_samples];

real sigma_gp_trend_short[N_samples];
real gp_len_trend_short[N_samples];

real sigma_gp_trend_long[N_samples];
real gp_len_trend_long[N_samples];
}
transformed data {
real t1_vec[max_t_idx];
real t2_vec[T - max_t_idx];
real jitter = 1e-12;
for (t in 1:max_t_idx)
t1_vec[t] = t;
for (t in 1:T - max_t_idx)
t2_vec[t] = t + max_t_idx;
}
parameters {
}
model {
}
generated quantities {
matrix[N_samples, N] y_pred;
matrix[N_samples, T] gp;

// generate the gp vector from the posterior samples.
for(s in 1:N_samples) {
vector[max_t_idx] f1;
vector[T - max_t_idx] f2;

{
matrix[max_t_idx, max_t_idx] C_t1;
matrix[max_t_idx, T - max_t_idx] C_t1_t2;
matrix[T - max_t_idx, max_t_idx] C_t2_t1;
matrix[T - max_t_idx, T - max_t_idx] C_t2;

matrix[T - max_t_idx, max_t_idx] A;

C_t1 = gp_matern32_cov(t1_vec, sigma_gp_ar[s], gp_len_ar[s]);

C_t1_t2 = gp_matern32_cov(t1_vec, t2_vec, sigma_gp_ar[s], gp_len_ar[s]);
C_t1_t2 += gp_exp_quad_cov(t1_vec, t2_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);

C_t2_t1 = gp_matern32_cov(t2_vec, t1_vec, sigma_gp_ar[s], gp_len_ar[s]);
C_t2_t1 += gp_exp_quad_cov(t2_vec, t1_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
C_t2_t1 += gp_exp_quad_cov(t2_vec, t1_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

C_t2 = gp_matern32_cov(t2_vec, sigma_gp_ar[s], gp_len_ar[s]);

if (size(t2_vec) > 0) {
A = C_t2_t1 * inverse(C_t1 + diag_matrix(rep_vector(square(sigma_noise[s]), max_t_idx)));
f2 = multi_normal_rng(A * to_vector(response_by_t), C_t2 - A * C_t1_t2);
}

f1 = multi_normal_rng(rep_vector(0, max_t_idx), C_t1) + normal_rng(0, sigma_noise[s]);

gp[s] = to_row_vector(append_row(f1, f2));
}
}

for(n in 1:N) {
row_vector[N_samples] group_mu_rep;
row_vector[N_samples] mu;

if (group_idx[n] > max_group_idx) {
for (s in 1:N_samples) {
group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
}
}
else {
group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
}

mu = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n])) + to_row_vector(gp[,t_idx[n]]);

y_pred[,n] = to_vector(student_t_rng(nu, mu, phi));

}
}

2 Likes

Great that you were able to code the predictions based on GPML book.

Stan User’s Guide section 10.3 subsubsection " Analytical Form of Joint Predictive Inference" has also the equations and Stan code https://mc-stan.org/docs/2_23/stan-users-guide/fit-gp-section.html#predictive-inference-with-a-gaussian-process. The code shown in Stan User’s Guide should be numerically more stable as mdivide_left_tri_low and mdivide_right_tri_low are used instead of inverse.

2 Likes

@avehtari

Am I correct in using the mean of the response by each week as data to use for the observed values of the Gaussian Process?

Meaning, each time index t is associated to more than one observation. The Gaussian Process would actually be operating on the mean of these observations during training and is what I should be using for response_by_t in the predictive distribution, right?

Yeah, you can sum covariance matrices. There’s an analytic (but not computational) example in BDA3 in the GP section on the birthday problem. If you’re not able to verify with simulations, I can send by some code and simulated data as an example. It’s always recommended to recover parameter or functions from simulated data, too.

This would equate to having one of your $K_i$’s above being a linear kernel (gp_dot_prod_cov) which I’ve yet to write docs for. Thanks for the reminder.

Hope this works out!

Hello @anon79882417, I’m confused by your statement regarding gp_dot_prod_cov.

I was following along with this tutorial: http://htmlpreview.github.io/?https://github.com/lauken13/Beginners_Bayes_Workshop/blob/master/stancon2019-intro.html.

Near the end, they build up the model to be a hierarchical model with gaussian process for the time component. Each time index has multiple observations associated to it. The model is found here:

Is this modeled correctly and if so, does the GP act on the mean of the observations for the time index?

We have to be careful with our language here. Carefully specifying what language we use, particularly using the language of mathematics, permits more effective communication. A statement like “act on” is not mathematically well defined.

I’ll set a reminder later this week to respond but I’ll focus on answering your questions in the initial post.

These seem to be: combining kernels, proper implementation in Stan and generating from the posterior predictive of a GP with a non-Gaussian likelihood.

I’ll provide R code and simulated data.

1 Like

More precisely, I have data shaped like the below:

observation_number group_index time_index unit_level_covariate_1 unit_level_covariate_2 group_level_covariate_1 response
1 1 1 -0.1 0.2 0.8 y_1
2 1 1 0.4 -0.6 0.8 y_2
3 2 1 0.34 -0.45 0.5 y_3
4 2 1 -0.25 0.3 0.5 y_4
5 2 1 0.24 5 0.5 y_5
6 3 1 -0.64 5 0.2 y_6
7 1 2 0.2 0.2 0.8 y_7
8 1 2 0.6 0.35 0.8 y_8
9 1 2 0.8 0.54 0.8 y_9
10 2 2 0.9 0.6 0.5 y_{10}
11 2 2 0.2 0.5 0.5 y_{11}
12 4 2 0.1 0.5 0.4 y_{12}
13 4 2 0.6 -0.3 0.4 y_{13}
14 5 2 0.7 0.9 0.1 y_{14}

What the unit-level and group-level covariates are and their values are immaterial; it suffices to know that they have been standardized to be on unit scale. The same goes for the response variable y_i.

It is very easy for me to describe a hierarchical random effects model for this data with a centered parameterization. Let N_u be the number of unit-level covariates and N_g be the number of group-level covariates. Then the data-generating model can be written as follows:

\begin{align} y_i &\sim t(\nu, \mu, \phi) \\ \nu &\sim \Gamma(2, 0.1) \\ \phi &= \frac{1} {\theta}, \ \theta \sim t(4, 0, 1) \\ \mu &= \mu_{g} + \sum_{j=1} ^ {N_u} \beta_j \cdot \text{unit_level_covariate}_j, \ \beta_j \sim t(4, 0, 1) \\ \mu_{g} &\sim t\left(4, \sum_{j=1} ^ {N_g} \gamma_j \cdot \text{group_level_covariate}_j, \sigma_g \right), \gamma_j \sim t(4, 0, 1) \\ \sigma_g &\sim \text{half-}t(4, 0, 1) \end{align}

The accompanying stan code would look like this:

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=1> J; //Number of groups
int<lower=1> K; //Number of group predictors
int<lower=1> O; //Number of observation predictors
int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
real response[N]; // The outcome of the data generation process
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
}
parameters {
real<lower=2> nu;
real<lower=0> inv_phi;
vector[O] beta;
vector[K] gamma;
vector[J] group_mu_raw;
real<lower=0> sigma_group_mu;

}
transformed parameters {
vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
vector[N] mu =
group_mu[group_idx] +
observation_data * beta;

real phi = 1 / inv_phi;
}
model {
nu ~ gamma(2, 0.1);
inv_phi ~ student_t(4, 0, 1);

beta ~ student_t(4, 0, 1);
gamma ~ student_t(4, 0, 1);

group_mu_raw ~ student_t(4, 0, 1);
sigma_group_mu  ~ student_t(4, 0, 1);

// likelihood function
response ~ student_t(nu, mu, phi);

}
generated quantities {
real y_rep[N];
for (n in 1:N) {
y_rep[n] = student_t_rng(nu, mu[n], phi);
}
}


Now, following along with the stancon tutorial, I saw that we can incorporate time varying effects by adding a linear term to the mean of the likelihood. Note that there are multiple observations and consequently groups for each time index t. If we say that the time-varying effect prior is a Gaussian Process, the data-generating model becomes:

\begin{align} y_i &\sim t(\nu, \mu, \phi) \\ \nu &\sim \Gamma(2, 0.1) \\ \phi &= \frac{1} {\theta}, \ \theta \sim t(4, 0, 1) \\ \mu &= w_t + \mu_{g} + \sum_{j=1} ^ {N_u} \beta_j \cdot \text{unit_level_covariate}_j, \ \beta_j \sim t(4, 0, 1) \\ w_t &= f + \epsilon, \ f \sim GP(0, K) \\ \epsilon &\sim N(0, \sigma_t), \ \sigma_t \sim \text{half-}N(0, 1) \\ \mu_{g} &\sim t\left(4, \sum_{j=1} ^ {N_g} \gamma_j \cdot \text{group_level_covariate}_j, \sigma_g \right), \gamma_j \sim t(4, 0, 1) \\ \sigma_g &\sim \text{half-}t(4, 0, 1) \end{align}

As we’ve been discussing, we can specify the priors for the covariance kernel to be any stationary kernel we wish and we can put appropriate priors on the the length-scale and amplitude parameters of that kernel.

The stan-code to generate this model will look like the below. Since the sum of covariance kernels is a covariance kernel, I use two squared exponential kernels (one for short trends, one for long trends) and a Matern \nu= 3/2 kernel for an auto-regressive component:

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=1> J; //Number of groups
int<lower=1> K; //Number of group predictors
int<lower=1> O; //Number of observation predictors
int<lower=1> T; //Number of time indexes.
int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
real response[N]; // The outcome of the data generation process
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
}
transformed data {
real t_gp_vec[T];
real jitter = 1e-12;
for (t in 1:T)
t_gp_vec[t] = t;
}
parameters {
real<lower=2> nu;
real<lower=0> inv_phi;
vector[O] beta;
vector[K] gamma;
vector[J] group_mu_raw;
real<lower=0> sigma_group_mu;

// GP prior parameters
// Noise component
vector[T] t_noise_raw;
vector[T] gp_z;
real<lower=0> sigma_noise;

// Autoregressive GP Component
real<lower=0> gp_len_ar;
real<lower=0> sigma_gp_ar;

// Long-term Trend GP Component
real<lower=0> gp_len_trend_long;
real<lower=0> sigma_gp_trend_long;

// Short-term Trend GP Component
real<lower=0> gp_len_trend_short;
real<lower=0> sigma_gp_trend_short;
}
transformed parameters {
vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
vector[N] mu =
group_mu[group_idx] +
observation_data * beta;

real phi = 1 / inv_phi;

vector[T] f;
vector[T] gp;
vector[T] t_noise = sigma_noise * t_noise_raw;

{
matrix[T, T] C;
matrix[T, T] L_C;

C = gp_matern32_cov(t_gp_vec, sigma_gp_ar, gp_len_ar);
for (t in 1:T)
C[t,t] += jitter;
L_C = cholesky_decompose(C);
f = L_C * gp_z;
}

// gp is sum of monthly noise and the smoothly varying processes and the autoregressive process
gp = t_noise + f;
}
model {
nu ~ gamma(2, 0.1);
inv_phi ~ student_t(4, 0, 1);

beta ~ student_t(4, 0, 1);
gamma ~ student_t(4, 0, 1);

group_mu_raw ~ student_t(4, 0, 1);
sigma_group_mu  ~ student_t(4, 0, 1);

// GP priors
// Noise component
gp_z ~ normal(0, 1);
sigma_noise ~ normal(0, 1);
t_noise_raw ~ normal(0, 1);

// Autoregressive Component
gp_len_ar ~ inv_gamma(7, 20);
sigma_gp_ar ~ normal(0, 1);

// Short-term Trend Component
gp_len_trend_short ~ inv_gamma(10, 20);
sigma_gp_trend_short ~ normal(0, 1);

// Long-term Trend Component
gp_len_trend_long ~ inv_gamma(7, 20);
sigma_gp_trend_long ~ normal(0, 1);

// likelihood function
response ~ student_t(nu, mu + gp[t_idx], phi);

}
generated quantities {
real y_rep[N];
for (n in 1:N) {
y_rep[n] = student_t_rng(nu, mu[n] + gp[t_idx[n]], phi);
}
}


The reason I was asking about “what the gaussian process acts on” is because I need to feed in observed values for the process f into the posterior-predictive distribution for the Gaussian Process and in turn for the posterior predictive distribution for the observed response values y_i. That is, I need to know what the observed values \textbf{y} are in eq 2.22 found here: http://www.gaussianprocess.org/gpml/chapters/RW2.pdf

From my limited understanding, for the data above, I believe \textbf{y} = \left<\frac{y_1 + y_2 + y_3 + y_4 + y_5 + y_6 }{6}, \frac{y_7 + y_8 + y_9 + y_{10} + y_{11} + y_{12} + y_{13} + y_{14}}{8}\right>, i.e. the average response for each time index t.

The reason I think this is the observed value vector is that the Gaussian Process regresses on the time index t and so the Gaussian Process would fit to the average value of time index t.

If this is incorrect, or if my use of the analytical form of the posterior predictive distribution for Gaussian processes of a Gaussian observation (the time trend), I apologize.

Clearing up any misconceptions on my part, if any, would be greatly appreciated.

If all of the above assumptions are correct, then I believe this code can be used to obtain the generating process for the posterior-predictive distribution.

// hierarchical model with non-centered parameterization of mu
data {
int<lower=1> N; //Number of observations
int<lower=0> J; //Number of groups
int<lower=0> K; //Number of group predictors
int<lower=0> O; //Number of observation predictor
int<lower=1> T; //Number of time indexes.
int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
matrix[N,O] observation_data; //covariate data of the observations
matrix[J,K] group_data; //covariate data of the groups
int<lower=0> max_group_idx; //Highest group index fit in model
int<lower=0> max_t_idx; //Highest time index fit in model
real response[N]; // The outcome of the data generation process
real response_by_t[max_t_idx]; //observed average for gaussian process.

int N_samples; // number of samples drawn from posterior distributions
real nu[N_samples]; // nu parameter from student t likelihood
real phi[N_samples]; // standard deviation parameter from student t likelihood
matrix[N_samples, O] beta; // covariate matrix draws for observations
matrix[N_samples, K] gamma; // covariate matrix draws for groups
matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.

// GP posterior samples
// Noise component
matrix[N_samples, max_t_idx] t_noise;
matrix[N_samples, max_t_idx] gp_z;
real sigma_noise[N_samples];

// Auto-regressive component
real sigma_gp_ar[N_samples];
real gp_len_ar[N_samples];

real sigma_gp_trend_short[N_samples];
real gp_len_trend_short[N_samples];

real sigma_gp_trend_long[N_samples];
real gp_len_trend_long[N_samples];
}
transformed data {
real t1_vec[max_t_idx];
real t2_vec[T - max_t_idx];
real jitter = 1e-12;
for (t in 1:max_t_idx)
t1_vec[t] = t;
for (t in 1:T - max_t_idx)
t2_vec[t] = t + max_t_idx;
}
parameters {
}
model {
}
generated quantities {
matrix[N_samples, N] y_pred;
matrix[N_samples, T] gp;

// generate the gp vector from the posterior samples.
for(s in 1:N_samples) {
vector[max_t_idx] f1;
vector[T - max_t_idx] f2;

{
matrix[max_t_idx, max_t_idx] L_C_t1;

matrix[max_t_idx, max_t_idx] C_t1;
matrix[max_t_idx, T - max_t_idx] C_t1_t2;
matrix[T - max_t_idx, max_t_idx] C_t2_t1;
matrix[T - max_t_idx, T - max_t_idx] C_t2;

matrix[T - max_t_idx, max_t_idx] A;

C_t1 = gp_matern32_cov(t1_vec, sigma_gp_ar[s], gp_len_ar[s]);

C_t1_t2 = gp_matern32_cov(t1_vec, t2_vec, sigma_gp_ar[s], gp_len_ar[s]);
C_t1_t2 += gp_exp_quad_cov(t1_vec, t2_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);

C_t2_t1 = gp_matern32_cov(t2_vec, t1_vec, sigma_gp_ar[s], gp_len_ar[s]);
C_t2_t1 += gp_exp_quad_cov(t2_vec, t1_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
C_t2_t1 += gp_exp_quad_cov(t2_vec, t1_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

C_t2 = gp_matern32_cov(t2_vec, sigma_gp_ar[s], gp_len_ar[s]);

if (size(t2_vec) > 0) {
A = C_t2_t1 * inverse(C_t1 + diag_matrix(rep_vector(square(sigma_noise[s]), max_t_idx)));
f2 = multi_normal_rng(A * to_vector(response_by_t), C_t2 - A * C_t1_t2);
}

for (t in 1:max_t_idx) {
C_t1[t, t] += jitter;
}

// Reuse samples from noise and multivariate normal for f1.
L_C_t1 = cholesky_decompose(C_t1);
f1 = L_C_t1 * to_vector(gp_z[s]) + to_vector(t_noise[s]);

gp[s] = to_row_vector(append_row(f1, f2));
}
}

for(n in 1:N) {
row_vector[N_samples] group_mu_rep;
row_vector[N_samples] mu;

if (group_idx[n] > max_group_idx) {
for (s in 1:N_samples) {
group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
}
}
else {
group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
}

mu = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n])) + to_row_vector(gp[,t_idx[n]]);

y_pred[,n] = to_vector(student_t_rng(nu, mu, phi));

}
}


Thanks for making this so clear and easy to read. I now see why you’re asking this question.

Your understanding is mostly fine, my intent wasn’t to be condescending. I appreciate your questions and I realize that our documentation could be more clear.

Here are the main things I think you’re missing:

1. This \textbf{y} would just be the time series of responses.
2. You have to implement the code for the predictive equations.
3. Start simple.

First, I’m providing a simple example on simulated data below. Hopefully understanding this will enable us to see how to incorporate this method into a larger model. This is a simplified version of example 21.2 in BDA3.

Here’s the function I’m simulating (r-code below, I’ve mean centered and standardized in some areas, but you can re-scale after fitting the model):

y_t(x) = f_1(x) + f_2(x) + \epsilon_t

For:

f_1(x) = 2 x (a simple linear function)
f_2(x) = sin(\frac{x}{2\pi}) (a sine wave)

Which looks like this:

And I’m going to try to recover f_1 and f_2. So I use the following model:

f_1(x) \sim GP(0, k_1), k_1(x, x') = \sigma_1^2exp(\frac{d(x,x')^2}{l_1^2})
f_2(x) \sim GP(0, k_2), k_2(x, x') = \sigma_2^2 (x, x')

For (x, x') an inner product, and d(x, x') the l-2 norm/euclidean distance.

A caveat is, for combining kernels, you have to specify what trend you’d like to make predictions for. So if you want to make predictions of the time series of just the observed linear trend, you’d need to implement the posterior predictive accordingly. So we’ve constructed the following covariance matrix:

k_{total} = k_1 + k_2

If we want to make predictions for k_{1,(\tilde{x}, x)}, the linear function, we can use the equations you’ve found in GPML, but for the cross covariance matrix (the left most one for the posterior predictive mean), you use the covariance structure you’d like to make predictions for. So, for the linear kernel:

\tilde{f}_1 = k_{1,(\tilde{x}, x)} (k_{total} + I \sigma_n^2)^{-1}\textbf{y}

Where k_1 is a cross-covariance matrix computed from the linear kernel, using data we’ve trained our model on and data we’d like to make predictions for (I just did seq(0,100, length.out = 1000) in R). The Stan code for the posterior predictive mean of a GP, which you need to implement in the functions block, being:

  vector gp_pred_dot_prod_rng(vector[] x2,
vector y, vector[] x1,
real m, real l,
real sig,
real sigma) {
int N1 = rows(y);
int N2 = size(x2);
vector[N2] f;
{
gp_dot_prod_cov(x1, sig), sigma);
matrix[N1, N1] L_K = cholesky_decompose(K);
vector[N1] L_K_div_y = mdivide_left_tri_low(L_K, y);
vector[N1] K_div_y = mdivide_right_tri_low(L_K_div_y', L_K)';
matrix[N1, N2] k_x1_x2 = gp_dot_prod_cov(x1, x2, sig);
f = (k_x1_x2' * K_div_y);
}
return f;
}


And plotting the estimated \tilde{f}_1 against the simulated f_1:

And then, analogously, for the squared-exponential:

\tilde{f}_2 = k_{2,(\tilde{x}, x)} (k_{total} + I_N \sigma^2)^{-1}\textbf{y}

\tilde{f}_{total} = k_{total,(\tilde{x}, x)} (k_{total} + I_N \sigma^2)^{-1}\textbf{y}

It’s not perfect, but I hope that it gets the point across. There’s lots of design choices the modeler needs to make. For example, I should have used the periodic covariance function (gp_periodic_cov) instead of the squared exponential, and fixed the period if this information was known. I could have modeled the noise more explicitly for each function. I should have put more effort into thinking about priors, I just used normal(0, 1) to be quick.

I’ve attached all the code (R and Stan code). I understand that I’ve used my own notation and language, and this doesn’t at all unify with the notation in the Stan-Con case study you’ve been using, so I’m happy if you have any questions, and I’ll return to discuss how to do this with the Stan-con example at a later time.

gp_discourse_sim.R (1.9 KB)

4 Likes

Thank you very much for the detailed example with the simulated data. The example makes sense and luckily I believe matches the code I’ve shared above.

However, from the data that I have, there are multiple observations for each time period and I wish to make a prediction of a single observation. With that in mind, am I forming the time series \textbf{y} correctly by taking the mean of the response grouped by each historical time index?

If you are modelling the replicates per timepoint as normal deviations from a mean function (the result of computing f1(x)+f2(x) in @anon79882417’s example), then the best prediction for an out-of-sample point is going to be the prediction from the mean function.

Hi Mike,

Does that still apply when used as I have done in the following model?

Particular emphasis on each time index having multiple observations associated to it, like in the table here:

If you are not modelling an interaction between the covariate and the GP, then yes, my statement applies. Take a look at the code here for a walk-through of GP regression with replicated observations.

So, what do I use for the observed vector \textbf{y} for out-of-sample predictions? The historical weekly average values, which are at a higher level of aggregation than the observations in the likelihood?

That is the “observed” vector of the latent GP.

Sorry, I chimed in without thoroughly digesting the full thread. when you say out-of-sample, do you mean new timepoints that are outside the domain of the original data, new timepoints that are in the domain of the original data but fall between the sampled-timepoints in the original data, or simply new predictions for the timepoints from the original data?

I mean the former, so new timepoints that are outside the domain of the original data.

Ah, darn, then I might have to retract and bow-out; I have no experience with prediction outside the domain of the original data. Indeed, by looking at everything you’ve done here I suspect you might be more expert than I am in GP stuff in general too!

I will say that so far as I understand, the fitting code from your prior post yields samples for the parameter gp at each timepoint, reflecting the latent mean GP function, and I think it is these values that you need to use as the “observed” values in your subsequent posterior prediction run (rather than attempting to compute an average on the actual observed data per timepoint).

1 Like

That makes perfect sense! I’ll give that a shot. Thank you.