Gaussian Process out-of-sample predictive distribution

matthew · July 22, 2020, 9:44pm

I’m trying to expand upon a model I built previously by using Gaussian Processes.

The data that I have is t-distributed that can be grouped hierarchically and modeled with random effects. In addition, the observations of the population can also be grouped together by week.

I’m trying to take my random effects model that makes use of unit-level covariates and group-level covariates and add in a linear term to the mean function of the likelihood that models the time trend.

I was able to write a stan program that I believe achieves this through Gaussian processes. The function gp keyed by week is the sum of a long term trend, a short term trend, an autoregressive component, and gaussian white noise. I believe this program models what I want to achieve correctly. The function gp is fed into the mean likelihood distribution along with the unit and group coefficients.

I’m not entirely sure I’m drawing from the multivariate distribution correctly, i.e. I might not be combining the kernels correctly. Is anyone able to verify?

The program is shown below:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=1> J; //Number of groups
  int<lower=1> K; //Number of group predictors
  int<lower=1> O; //Number of observation predictors
  int<lower=1> T; //Number of time indexes.
  int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
  int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
}
transformed data {
  real t_gp_vec[T];
  real jitter = 1e-12;
  for (t in 1:T)
    t_gp_vec[t] = t;
}
parameters {
  real<lower=2> nu;
  real<lower=0> inv_phi;
  vector[O] beta;
  vector[K] gamma;
  vector[J] group_mu_raw;
  real<lower=0> sigma_group_mu;

  // GP prior parameters
  // Noise component
  vector[T] t_noise_raw;
  real<lower=0> sigma_noise;

  // Autoregressive GP Component
  vector[T] gp_raw_ar;
  real<lower=0> gp_len_ar;
  real<lower=0> sigma_gp_ar;

  // Long-term Trend GP Component
  vector[T] gp_raw_trend_long;
  real<lower=0> gp_len_trend_long;
  real<lower=0> sigma_gp_trend_long;

  // Short-term Trend GP Component
  vector[T] gp_raw_trend_short;
  real<lower=0> gp_len_trend_short;
  real<lower=0> sigma_gp_trend_short;
}
transformed parameters {
  vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
  vector[N] mu =
    group_mu[group_idx] +
    observation_data * beta;

  real phi = 1 / inv_phi;

  vector[T] gp_ar;
  vector[T] gp_trend_short;
  vector[T] gp_trend_long;
  vector[T] gp;
  vector[T] t_noise = sigma_noise * t_noise_raw;

  {
    matrix[T, T] C = gp_matern32_cov(t_gp_vec, sigma_gp_ar, gp_len_ar);
    matrix[T, T] L_C;
    for (t in 1:T)
      C[t,t] += jitter;
    L_C = cholesky_decompose(C);
    gp_ar = L_C * gp_raw_ar;
  }

  {
    matrix[T, T] C = gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_short, gp_len_trend_short);
    matrix[T, T] L_C;
    for (t in 1:T)
      C[t,t] += jitter;
    L_C = cholesky_decompose(C);
    gp_trend_short = L_C * gp_raw_trend_short;
  }

  {
    matrix[T, T] C = gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_long, gp_len_trend_long);
    matrix[T, T] L_C;
    for (t in 1:T)
      C[t,t] += jitter;
    L_C = cholesky_decompose(C);
    gp_trend_long = L_C * gp_raw_trend_long;
  }

  // gp is sum of monthly noise and the smoothly varying processes and the autoregressive process
  gp = t_noise + gp_ar + gp_trend_short + gp_trend_long;
}
model {
  nu ~ gamma(2, 0.1);
  inv_phi ~ student_t(4, 0, 1);

  beta ~ student_t(4, 0, 1);
  gamma ~ student_t(4, 0, 1);

  group_mu_raw ~ student_t(4, 0, 1);
  sigma_group_mu  ~ student_t(4, 0, 1);

  // GP priors
  // Noise component
  sigma_noise ~ normal(0, 1);
  t_noise_raw ~ normal(0, 1);

  // Autoregressive Component
  gp_raw_ar ~ normal(0, 1);
  gp_len_ar ~ inv_gamma(5, 5);
  sigma_gp_ar ~ normal(0, 1);

  // Short-term Trend Component
  gp_raw_trend_short ~ normal(0, 1);
  gp_len_trend_short ~ inv_gamma(5, 5);
  sigma_gp_trend_short ~ normal(0, 1);

  // Long-term Trend Component
  gp_raw_trend_long ~ normal(0, 1);
  gp_len_trend_long ~ inv_gamma(5, 5);
  sigma_gp_trend_long ~ normal(0, 1);

  // likelihood function
  response ~ student_t(nu, mu + gp[t_idx], phi);

}
generated quantities {
  real y_rep[N];
  for (n in 1:N) {
    y_rep[n] = student_t_rng(nu, mu[n] + gp[t_idx[n]], phi);
  }
}

However, I am now trying to generate out-of-sample predictions with the model and am running into some difficulty. I was wondering if the below code is the right way to generate those predictions. In particular, I’m not sure I’m handling the GP part correctly.

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=0> J; //Number of groups
  int<lower=0> K; //Number of group predictors
  int<lower=0> O; //Number of observation predictor
  int<lower=1> T; //Number of time indexes.
  int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
  int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
  int<lower=0> max_group_idx; //Highest group index fit in model
  int<lower=0> max_t_idx; //Highest time index fit in model

  int N_samples; // number of samples drawn from posterior distributions
  real nu[N_samples]; // nu parameter from student t likelihood
  real phi[N_samples]; // standard deviation parameter from student t likelihood
  matrix[N_samples, O] beta; // covariate matrix draws for observations
  matrix[N_samples, K] gamma; // covariate matrix draws for groups
  matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
  real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.

  // GP posterior samples
  // Noise component
  matrix[N_samples, max_t_idx] t_noise;
  real sigma_noise[N_samples];

  // Auto-regressive component
  real sigma_gp_ar[N_samples];
  real gp_len_ar[N_samples];
  matrix[N_samples, max_t_idx] gp_raw_ar;

  real sigma_gp_trend_short[N_samples];
  real gp_len_trend_short[N_samples];
  matrix[N_samples, max_t_idx] gp_raw_trend_short;

  real sigma_gp_trend_long[N_samples];
  real gp_len_trend_long[N_samples];
  matrix[N_samples, max_t_idx] gp_raw_trend_long;
}
transformed data {
  real t_gp_vec[T];
  real jitter = 1e-12;
  for (t in 1:T)
    t_gp_vec[t] = t;
}
parameters {
}
model {
}
generated quantities {
  matrix[N_samples, N] y_pred;
  matrix[N_samples, T] gp;

  // generate the gp vector from the posterior samples.
  for(s in 1:N_samples) {
    vector[T] gp_ar;
    vector[T] gp_trend_short;
    vector[T] gp_trend_long;

    vector[T] gp_raw_ar_full;
    vector[T] gp_raw_trend_short_full;
    vector[T] gp_raw_trend_long_full;
    vector[T] t_noise_full;

    for(t in 1:max_t_idx){
      gp_raw_ar_full[t] = gp_raw_ar[s, t];
      gp_raw_trend_short_full[t] = gp_raw_trend_short[s, t];
      gp_raw_trend_long_full[t] = gp_raw_trend_long[s, t];
      t_noise_full[t] = t_noise[s, t];
    }
    for(t in max_t_idx+1:T){
      gp_raw_ar_full[t] = normal_rng(0, 1);
      gp_raw_trend_short_full[t] = normal_rng(0, 1);
      gp_raw_trend_long_full[t] = normal_rng(0, 1);
      t_noise_full[t] = normal_rng(0, sigma_noise[s]);
    }

    {
       matrix[T, T] C = gp_matern32_cov(t_gp_vec, sigma_gp_ar[s], gp_len_ar[s]);
       matrix[T, T] L_C;
       for (t in 1:T)
         C[t, t] += jitter;
       L_C = cholesky_decompose(C);
       gp_ar = L_C * gp_raw_ar_full;
    }

    {
       matrix[T, T] C = gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       matrix[T, T] L_C;
       for (t in 1:T)
         C[t, t] += jitter;
       L_C = cholesky_decompose(C);
       gp_trend_short = L_C * gp_raw_trend_short_full;
    }

    {
       matrix[T, T] C = gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);
       matrix[T, T] L_C;
       for (t in 1:T)
         C[t, t] += jitter;
       L_C = cholesky_decompose(C);
       gp_trend_long = L_C * gp_raw_trend_long_full;
    }

    gp[s] = to_row_vector(t_noise_full + gp_ar + gp_trend_short + gp_trend_long);
  }

  for(n in 1:N) {
    row_vector[N_samples] group_mu_rep;
    row_vector[N_samples] mu;

    if (group_idx[n] > max_group_idx) {
      for (s in 1:N_samples) {
         group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
      }
    }
    else {
      group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
    }

    mu = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n])) + to_row_vector(gp[,t_idx[n]]);

    y_pred[,n] = to_vector(student_t_rng(nu, mu, phi));

  }
}

Is there any literature on combining multiple kernels to generate out-of-sample data, especially when the observation process is not normally distributed?

UPDATE: here is what I believe to be the correct process now:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=1> J; //Number of groups
  int<lower=1> K; //Number of group predictors
  int<lower=1> O; //Number of observation predictors
  int<lower=1> T; //Number of time indexes.
  int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
  int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
}
transformed data {
  real t_gp_vec[T];
  real jitter = 1e-12;
  for (t in 1:T)
    t_gp_vec[t] = t;
}
parameters {
  real<lower=2> nu;
  real<lower=0> inv_phi;
  vector[O] beta;
  vector[K] gamma;
  vector[J] group_mu_raw;
  real<lower=0> sigma_group_mu;

  // GP prior parameters
  // Noise component
  vector[T] t_noise_raw;
  vector[T] gp_z;
  real<lower=0> sigma_noise;

  // Autoregressive GP Component
  real<lower=0> gp_len_ar;
  real<lower=0> sigma_gp_ar;

  // Long-term Trend GP Component
  real<lower=0> gp_len_trend_long;
  real<lower=0> sigma_gp_trend_long;

  // Short-term Trend GP Component
  real<lower=0> gp_len_trend_short;
  real<lower=0> sigma_gp_trend_short;
}
transformed parameters {
  vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
  vector[N] mu =
    group_mu[group_idx] +
    observation_data * beta;

  real phi = 1 / inv_phi;

  vector[T] f;
  vector[T] gp;
  vector[T] t_noise = sigma_noise * t_noise_raw;

  {
    matrix[T, T] C;
    matrix[T, T] L_C;

    C = gp_matern32_cov(t_gp_vec, sigma_gp_ar, gp_len_ar);
    C += gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_short, gp_len_trend_short);
    C += gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_long, gp_len_trend_long);
    for (t in 1:T)
      C[t,t] += jitter;
    L_C = cholesky_decompose(C);
    f = L_C * gp_z;
  }

  // gp is sum of monthly noise and the smoothly varying processes and the autoregressive process
  gp = t_noise + f;
}
model {
  nu ~ gamma(2, 0.1);
  inv_phi ~ student_t(4, 0, 1);

  beta ~ student_t(4, 0, 1);
  gamma ~ student_t(4, 0, 1);

  group_mu_raw ~ student_t(4, 0, 1);
  sigma_group_mu  ~ student_t(4, 0, 1);

  // GP priors
  // Noise component
  gp_z ~ normal(0, 1);
  sigma_noise ~ normal(0, 1);
  t_noise_raw ~ normal(0, 1);

  // Autoregressive Component
  gp_len_ar ~ inv_gamma(5, 5);
  sigma_gp_ar ~ normal(0, 1);

  // Short-term Trend Component
  gp_len_trend_short ~ inv_gamma(5, 5);
  sigma_gp_trend_short ~ normal(0, 1);

  // Long-term Trend Component
  gp_len_trend_long ~ inv_gamma(5, 5);
  sigma_gp_trend_long ~ normal(0, 1);

  // likelihood function
  response ~ student_t(nu, mu + gp[t_idx], phi);

}
generated quantities {
  real y_rep[N];
  for (n in 1:N) {
    y_rep[n] = student_t_rng(nu, mu[n] + gp[t_idx[n]], phi);
  }
}

and the associated out-of-sample draws:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=0> J; //Number of groups
  int<lower=0> K; //Number of group predictors
  int<lower=0> O; //Number of observation predictor
  int<lower=1> T; //Number of time indexes.
  int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
  int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
  int<lower=0> max_group_idx; //Highest group index fit in model
  int<lower=0> max_t_idx; //Highest time index fit in model

  int N_samples; // number of samples drawn from posterior distributions
  real nu[N_samples]; // nu parameter from student t likelihood
  real phi[N_samples]; // standard deviation parameter from student t likelihood
  matrix[N_samples, O] beta; // covariate matrix draws for observations
  matrix[N_samples, K] gamma; // covariate matrix draws for groups
  matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
  real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.

  // GP posterior samples
  // Noise component
  matrix[N_samples, max_t_idx] t_noise;
  matrix[N_samples, max_t_idx] gp_z;
  real sigma_noise[N_samples];

  // Auto-regressive component
  real sigma_gp_ar[N_samples];
  real gp_len_ar[N_samples];

  real sigma_gp_trend_short[N_samples];
  real gp_len_trend_short[N_samples];

  real sigma_gp_trend_long[N_samples];
  real gp_len_trend_long[N_samples];
}
transformed data {
  real t_gp_vec[T];
  real jitter = 1e-12;
  for (t in 1:T)
    t_gp_vec[t] = t;
}
parameters {
}
model {
}
generated quantities {
  matrix[N_samples, N] y_pred;
  matrix[N_samples, T] gp;

  // generate the gp vector from the posterior samples.
  for(s in 1:N_samples) {
    vector[T] f;

    vector[T] gp_z_full;
    vector[T] t_noise_full;

    for(t in 1:max_t_idx){
      t_noise_full[t] = t_noise[s, t];
      gp_z_full[t] = gp_z[s, t];
    }
    for(t in max_t_idx+1:T){
      t_noise_full[t] = normal_rng(0, sigma_noise[s]);
      gp_z_full[t] = normal_rng(0, 1);
    }

    {
       matrix[T, T] C;
       matrix[T, T] L_C;

       C = gp_matern32_cov(t_gp_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C += gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C += gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);
       for (t in 1:T)
         C[t, t] += jitter;
       L_C = cholesky_decompose(C);
       f = L_C * gp_z_full;
    }

    gp[s] = to_row_vector(t_noise_full + f);
  }

  for(n in 1:N) {
    row_vector[N_samples] group_mu_rep;
    row_vector[N_samples] mu;

    if (group_idx[n] > max_group_idx) {
      for (s in 1:N_samples) {
         group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
      }
    }
    else {
      group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
    }

    mu = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n])) + to_row_vector(gp[,t_idx[n]]);

    y_pred[,n] = to_vector(student_t_rng(nu, mu, phi));

  }
}

matthew · July 22, 2020, 10:00pm

I think I can say that my approach in the training part is actually wrong.

The Gaussian Process gp with the above covariance matrices is distributed as so

\textbf{gp} \sim N (\textbf{0}, \Sigma_1 + \Sigma_2 + \Sigma_3) .

Note that if \Sigma_1 + \Sigma_2 + \Sigma_3 = L L^T for some matrix L, then we can say that

\textbf{gp} = \textbf{0} + L \textbf{Z}, \ Z_i \sim N(0, 1),

i.e. I should be taking the sum of the covariance matrices and then taking the cholesky decomposition of the sum and multiplying by the iid Gaussian RVs.

I am still slightly confused by the out-of-sample generation though.I think can repeat this approach with out-of-sample predictions by using the posterior estimates for length-scale and amplitude to estimate the covariance matrix sum, take the cholesky decomposition and just generate iid Gaussian RVs for both the training and out-of-sample time periods.

Is this correct?

matthew · July 23, 2020, 2:54pm

@anon79882417, do you have any experience with this topic of out-of-sample predictive distribution draws?

matthew · July 23, 2020, 11:00pm

I found some literature here: http://www.gaussianprocess.org/gpml/chapters/RW2.pdf

Eq 2.22 gives the analytical form of the posterior predictive distribution. Following that made it obvious what the predictive step should be.

For posterity, here are the updated train/prediction steps:

train:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=1> J; //Number of groups
  int<lower=1> K; //Number of group predictors
  int<lower=1> O; //Number of observation predictors
  int<lower=1> T; //Number of time indexes.
  int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
  int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
}
transformed data {
  real t_gp_vec[T];
  real jitter = 1e-12;
  for (t in 1:T)
    t_gp_vec[t] = t;
}
parameters {
  real<lower=2> nu;
  real<lower=0> inv_phi;
  vector[O] beta;
  vector[K] gamma;
  vector[J] group_mu_raw;
  real<lower=0> sigma_group_mu;

  // GP prior parameters
  // Noise component
  vector[T] t_noise_raw;
  vector[T] gp_z;
  real<lower=0> sigma_noise;

  // Autoregressive GP Component
  real<lower=0> gp_len_ar;
  real<lower=0> sigma_gp_ar;

  // Long-term Trend GP Component
  real<lower=0> gp_len_trend_long;
  real<lower=0> sigma_gp_trend_long;

  // Short-term Trend GP Component
  real<lower=0> gp_len_trend_short;
  real<lower=0> sigma_gp_trend_short;
}
transformed parameters {
  vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
  vector[N] mu =
    group_mu[group_idx] +
    observation_data * beta;

  real phi = 1 / inv_phi;

  vector[T] f;
  vector[T] gp;
  vector[T] t_noise = sigma_noise * t_noise_raw;

  {
    matrix[T, T] C;
    matrix[T, T] L_C;

    C = gp_matern32_cov(t_gp_vec, sigma_gp_ar, gp_len_ar);
    C += gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_short, gp_len_trend_short);
    C += gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_long, gp_len_trend_long);
    for (t in 1:T)
      C[t,t] += jitter;
    L_C = cholesky_decompose(C);
    f = L_C * gp_z;
  }

  // gp is sum of monthly noise and the smoothly varying processes and the autoregressive process
  gp = t_noise + f;
}
model {
  nu ~ gamma(2, 0.1);
  inv_phi ~ student_t(4, 0, 1);

  beta ~ student_t(4, 0, 1);
  gamma ~ student_t(4, 0, 1);

  group_mu_raw ~ student_t(4, 0, 1);
  sigma_group_mu  ~ student_t(4, 0, 1);

  // GP priors
  // Noise component
  gp_z ~ normal(0, 1);
  sigma_noise ~ normal(0, 1);
  t_noise_raw ~ normal(0, 1);

  // Autoregressive Component
  gp_len_ar ~ inv_gamma(5, 5);
  sigma_gp_ar ~ normal(0, 1);

  // Short-term Trend Component
  gp_len_trend_short ~ inv_gamma(5, 5);
  sigma_gp_trend_short ~ normal(0, 1);

  // Long-term Trend Component
  gp_len_trend_long ~ inv_gamma(5, 5);
  sigma_gp_trend_long ~ normal(0, 1);

  // likelihood function
  response ~ student_t(nu, mu + gp[t_idx], phi);

}
generated quantities {
  real y_rep[N];
  for (n in 1:N) {
    y_rep[n] = student_t_rng(nu, mu[n] + gp[t_idx[n]], phi);
  }
}

prediction:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=0> J; //Number of groups
  int<lower=0> K; //Number of group predictors
  int<lower=0> O; //Number of observation predictor
  int<lower=1> T; //Number of time indexes.
  int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
  int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
  int<lower=0> max_group_idx; //Highest group index fit in model
  int<lower=0> max_t_idx; //Highest time index fit in model
  real response[N]; // The outcome of the data generation process
  real response_by_t[max_t_idx]; //observed average for gaussian process.

  int N_samples; // number of samples drawn from posterior distributions
  real nu[N_samples]; // nu parameter from student t likelihood
  real phi[N_samples]; // standard deviation parameter from student t likelihood
  matrix[N_samples, O] beta; // covariate matrix draws for observations
  matrix[N_samples, K] gamma; // covariate matrix draws for groups
  matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
  real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.

  // GP posterior samples
  // Noise component
  matrix[N_samples, max_t_idx] t_noise;
  matrix[N_samples, max_t_idx] gp_z;
  real sigma_noise[N_samples];

  // Auto-regressive component
  real sigma_gp_ar[N_samples];
  real gp_len_ar[N_samples];

  real sigma_gp_trend_short[N_samples];
  real gp_len_trend_short[N_samples];

  real sigma_gp_trend_long[N_samples];
  real gp_len_trend_long[N_samples];
}
transformed data {
  real t1_vec[max_t_idx];
  real t2_vec[T - max_t_idx];
  real jitter = 1e-12;
  for (t in 1:max_t_idx)
    t1_vec[t] = t;
  for (t in 1:T - max_t_idx)
    t2_vec[t] = t + max_t_idx;
}
parameters {
}
model {
}
generated quantities {
  matrix[N_samples, N] y_pred;
  matrix[N_samples, T] gp;

  // generate the gp vector from the posterior samples.
  for(s in 1:N_samples) {
    vector[max_t_idx] f1;
    vector[T - max_t_idx] f2;

    {
       matrix[max_t_idx, max_t_idx] C_t1;
       matrix[max_t_idx, T - max_t_idx] C_t1_t2;
       matrix[T - max_t_idx, max_t_idx] C_t2_t1;
       matrix[T - max_t_idx, T - max_t_idx] C_t2;

       matrix[T - max_t_idx, max_t_idx] A;

       C_t1 = gp_matern32_cov(t1_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C_t1 += gp_exp_quad_cov(t1_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C_t1 += gp_exp_quad_cov(t1_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

       C_t1_t2 = gp_matern32_cov(t1_vec, t2_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C_t1_t2 += gp_exp_quad_cov(t1_vec, t2_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C_t1_t2+= gp_exp_quad_cov(t1_vec, t2_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

       C_t2_t1 = gp_matern32_cov(t2_vec, t1_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C_t2_t1 += gp_exp_quad_cov(t2_vec, t1_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C_t2_t1 += gp_exp_quad_cov(t2_vec, t1_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

       C_t2 = gp_matern32_cov(t2_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C_t2 += gp_exp_quad_cov(t2_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C_t2 += gp_exp_quad_cov(t2_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

       if (size(t2_vec) > 0) {
         A = C_t2_t1 * inverse(C_t1 + diag_matrix(rep_vector(square(sigma_noise[s]), max_t_idx)));
         f2 = multi_normal_rng(A * to_vector(response_by_t), C_t2 - A * C_t1_t2);
       }

       f1 = multi_normal_rng(rep_vector(0, max_t_idx), C_t1) + normal_rng(0, sigma_noise[s]);

       gp[s] = to_row_vector(append_row(f1, f2));
    }
  }

  for(n in 1:N) {
    row_vector[N_samples] group_mu_rep;
    row_vector[N_samples] mu;

    if (group_idx[n] > max_group_idx) {
      for (s in 1:N_samples) {
         group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
      }
    }
    else {
      group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
    }

    mu = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n])) + to_row_vector(gp[,t_idx[n]]);

    y_pred[,n] = to_vector(student_t_rng(nu, mu, phi));

  }
}

avehtari · July 24, 2020, 8:26am

Great that you were able to code the predictions based on GPML book.

Stan User’s Guide section 10.3 subsubsection " Analytical Form of Joint Predictive Inference" has also the equations and Stan code https://mc-stan.org/docs/2_23/stan-users-guide/fit-gp-section.html#predictive-inference-with-a-gaussian-process. The code shown in Stan User’s Guide should be numerically more stable as mdivide_left_tri_low and mdivide_right_tri_low are used instead of inverse.

matthew · July 24, 2020, 3:20pm

@avehtari

Am I correct in using the mean of the response by each week as data to use for the observed values of the Gaussian Process?

Meaning, each time index t is associated to more than one observation. The Gaussian Process would actually be operating on the mean of these observations during training and is what I should be using for response_by_t in the predictive distribution, right?

anon79882417 · July 24, 2020, 11:17pm

Yeah, you can sum covariance matrices. There’s an analytic (but not computational) example in BDA3 in the GP section on the birthday problem. If you’re not able to verify with simulations, I can send by some code and simulated data as an example. It’s always recommended to recover parameter or functions from simulated data, too.

This would equate to having one of your $K_i$’s above being a linear kernel (gp_dot_prod_cov) which I’ve yet to write docs for. Thanks for the reminder.

Hope this works out!

matthew · July 25, 2020, 5:09am

Hello @anon79882417, I’m confused by your statement regarding gp_dot_prod_cov.

I was following along with this tutorial: http://htmlpreview.github.io/?https://github.com/lauken13/Beginners_Bayes_Workshop/blob/master/stancon2019-intro.html.

Near the end, they build up the model to be a hierarchical model with gaussian process for the time component. Each time index has multiple observations associated to it. The model is found here:

github.com

lauken13/Beginners_Bayes_Workshop/blob/master/stan_programs/hier_NB_regression_ncp_slopes_mod_mos_gp.stan

functions {
  int neg_binomial_2_log_safe_rng(real eta, real phi) {
    real phi_div_exp_eta;
    real gamma_rate;
    phi_div_exp_eta = phi/exp(eta);
    gamma_rate = gamma_rng(phi, phi_div_exp_eta);
    if (gamma_rate >= exp(20.79))
      return -9;
    return poisson_rng(gamma_rate);
  }
}
data {
  int<lower=1> N;
  int<lower=1> M;
  int<lower=1> K;
  int complaints[N];
  vector[N] traps;
  int<lower=1> J;
  int<lower=1, upper=J> building_idx[N];
  matrix[J,K] building_data;

This file has been truncated. show original

Is this modeled correctly and if so, does the GP act on the mean of the observations for the time index?

anon79882417 · July 25, 2020, 8:49pm

We have to be careful with our language here. Carefully specifying what language we use, particularly using the language of mathematics, permits more effective communication. A statement like “act on” is not mathematically well defined.

I’ll set a reminder later this week to respond but I’ll focus on answering your questions in the initial post.

These seem to be: combining kernels, proper implementation in Stan and generating from the posterior predictive of a GP with a non-Gaussian likelihood.

I’ll provide R code and simulated data.

matthew · July 25, 2020, 11:47pm

@anon79882417,

More precisely, I have data shaped like the below:

observation_number	group_index	time_index	unit_level_covariate_1	unit_level_covariate_2	group_level_covariate_1	response
1	1	1	-0.1	0.2	0.8	y_1
2	1	1	0.4	-0.6	0.8	y_2
3	2	1	0.34	-0.45	0.5	y_3
4	2	1	-0.25	0.3	0.5	y_4
5	2	1	0.24	5	0.5	y_5
6	3	1	-0.64	5	0.2	y_6
7	1	2	0.2	0.2	0.8	y_7
8	1	2	0.6	0.35	0.8	y_8
9	1	2	0.8	0.54	0.8	y_9
10	2	2	0.9	0.6	0.5	y_{10}
11	2	2	0.2	0.5	0.5	y_{11}
12	4	2	0.1	0.5	0.4	y_{12}
13	4	2	0.6	-0.3	0.4	y_{13}
14	5	2	0.7	0.9	0.1	y_{14}

What the unit-level and group-level covariates are and their values are immaterial; it suffices to know that they have been standardized to be on unit scale. The same goes for the response variable y_i.

It is very easy for me to describe a hierarchical random effects model for this data with a centered parameterization. Let N_u be the number of unit-level covariates and N_g be the number of group-level covariates. Then the data-generating model can be written as follows:

\begin{align} y_i &\sim t(\nu, \mu, \phi) \\ \nu &\sim \Gamma(2, 0.1) \\ \phi &= \frac{1} {\theta}, \ \theta \sim t(4, 0, 1) \\ \mu &= \mu_{g} + \sum_{j=1} ^ {N_u} \beta_j \cdot \text{unit_level_covariate}_j, \ \beta_j \sim t(4, 0, 1) \\ \mu_{g} &\sim t\left(4, \sum_{j=1} ^ {N_g} \gamma_j \cdot \text{group_level_covariate}_j, \sigma_g \right), \gamma_j \sim t(4, 0, 1) \\ \sigma_g &\sim \text{half-}t(4, 0, 1) \end{align}

The accompanying stan code would look like this:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=1> J; //Number of groups
  int<lower=1> K; //Number of group predictors
  int<lower=1> O; //Number of observation predictors
  int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
}
parameters {
  real<lower=2> nu;
  real<lower=0> inv_phi;
  vector[O] beta;
  vector[K] gamma;
  vector[J] group_mu_raw;
  real<lower=0> sigma_group_mu;

}
transformed parameters {
  vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
  vector[N] mu =
    group_mu[group_idx] +
    observation_data * beta;

  real phi = 1 / inv_phi;
}
model {
  nu ~ gamma(2, 0.1);
  inv_phi ~ student_t(4, 0, 1);

  beta ~ student_t(4, 0, 1);
  gamma ~ student_t(4, 0, 1);

  group_mu_raw ~ student_t(4, 0, 1);
  sigma_group_mu  ~ student_t(4, 0, 1);

  // likelihood function
  response ~ student_t(nu, mu, phi);

}
generated quantities {
  real y_rep[N];
  for (n in 1:N) {
    y_rep[n] = student_t_rng(nu, mu[n], phi);
  }
}

Now, following along with the stancon tutorial, I saw that we can incorporate time varying effects by adding a linear term to the mean of the likelihood. Note that there are multiple observations and consequently groups for each time index t. If we say that the time-varying effect prior is a Gaussian Process, the data-generating model becomes:

\begin{align} y_i &\sim t(\nu, \mu, \phi) \\ \nu &\sim \Gamma(2, 0.1) \\ \phi &= \frac{1} {\theta}, \ \theta \sim t(4, 0, 1) \\ \mu &= w_t + \mu_{g} + \sum_{j=1} ^ {N_u} \beta_j \cdot \text{unit_level_covariate}_j, \ \beta_j \sim t(4, 0, 1) \\ w_t &= f + \epsilon, \ f \sim GP(0, K) \\ \epsilon &\sim N(0, \sigma_t), \ \sigma_t \sim \text{half-}N(0, 1) \\ \mu_{g} &\sim t\left(4, \sum_{j=1} ^ {N_g} \gamma_j \cdot \text{group_level_covariate}_j, \sigma_g \right), \gamma_j \sim t(4, 0, 1) \\ \sigma_g &\sim \text{half-}t(4, 0, 1) \end{align}

As we’ve been discussing, we can specify the priors for the covariance kernel to be any stationary kernel we wish and we can put appropriate priors on the the length-scale and amplitude parameters of that kernel.

The stan-code to generate this model will look like the below. Since the sum of covariance kernels is a covariance kernel, I use two squared exponential kernels (one for short trends, one for long trends) and a Matern \nu= 3/2 kernel for an auto-regressive component:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=1> J; //Number of groups
  int<lower=1> K; //Number of group predictors
  int<lower=1> O; //Number of observation predictors
  int<lower=1> T; //Number of time indexes.
  int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
  int<lower=1, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
}
transformed data {
  real t_gp_vec[T];
  real jitter = 1e-12;
  for (t in 1:T)
    t_gp_vec[t] = t;
}
parameters {
  real<lower=2> nu;
  real<lower=0> inv_phi;
  vector[O] beta;
  vector[K] gamma;
  vector[J] group_mu_raw;
  real<lower=0> sigma_group_mu;

  // GP prior parameters
  // Noise component
  vector[T] t_noise_raw;
  vector[T] gp_z;
  real<lower=0> sigma_noise;

  // Autoregressive GP Component
  real<lower=0> gp_len_ar;
  real<lower=0> sigma_gp_ar;

  // Long-term Trend GP Component
  real<lower=0> gp_len_trend_long;
  real<lower=0> sigma_gp_trend_long;

  // Short-term Trend GP Component
  real<lower=0> gp_len_trend_short;
  real<lower=0> sigma_gp_trend_short;
}
transformed parameters {
  vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
  vector[N] mu =
    group_mu[group_idx] +
    observation_data * beta;

  real phi = 1 / inv_phi;

  vector[T] f;
  vector[T] gp;
  vector[T] t_noise = sigma_noise * t_noise_raw;

  {
    matrix[T, T] C;
    matrix[T, T] L_C;

    C = gp_matern32_cov(t_gp_vec, sigma_gp_ar, gp_len_ar);
    C += gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_short, gp_len_trend_short);
    C += gp_exp_quad_cov(t_gp_vec, sigma_gp_trend_long, gp_len_trend_long);
    for (t in 1:T)
      C[t,t] += jitter;
    L_C = cholesky_decompose(C);
    f = L_C * gp_z;
  }

  // gp is sum of monthly noise and the smoothly varying processes and the autoregressive process
  gp = t_noise + f;
}
model {
  nu ~ gamma(2, 0.1);
  inv_phi ~ student_t(4, 0, 1);

  beta ~ student_t(4, 0, 1);
  gamma ~ student_t(4, 0, 1);

  group_mu_raw ~ student_t(4, 0, 1);
  sigma_group_mu  ~ student_t(4, 0, 1);

  // GP priors
  // Noise component
  gp_z ~ normal(0, 1);
  sigma_noise ~ normal(0, 1);
  t_noise_raw ~ normal(0, 1);

  // Autoregressive Component
  gp_len_ar ~ inv_gamma(7, 20);
  sigma_gp_ar ~ normal(0, 1);

  // Short-term Trend Component
  gp_len_trend_short ~ inv_gamma(10, 20);
  sigma_gp_trend_short ~ normal(0, 1);

  // Long-term Trend Component
  gp_len_trend_long ~ inv_gamma(7, 20);
  sigma_gp_trend_long ~ normal(0, 1);

  // likelihood function
  response ~ student_t(nu, mu + gp[t_idx], phi);

}
generated quantities {
  real y_rep[N];
  for (n in 1:N) {
    y_rep[n] = student_t_rng(nu, mu[n] + gp[t_idx[n]], phi);
  }
}

The reason I was asking about “what the gaussian process acts on” is because I need to feed in observed values for the process f into the posterior-predictive distribution for the Gaussian Process and in turn for the posterior predictive distribution for the observed response values y_i. That is, I need to know what the observed values \textbf{y} are in eq 2.22 found here: http://www.gaussianprocess.org/gpml/chapters/RW2.pdf

From my limited understanding, for the data above, I believe \textbf{y} = \left<\frac{y_1 + y_2 + y_3 + y_4 + y_5 + y_6 }{6}, \frac{y_7 + y_8 + y_9 + y_{10} + y_{11} + y_{12} + y_{13} + y_{14}}{8}\right>, i.e. the average response for each time index t.

The reason I think this is the observed value vector is that the Gaussian Process regresses on the time index t and so the Gaussian Process would fit to the average value of time index t.

If this is incorrect, or if my use of the analytical form of the posterior predictive distribution for Gaussian processes of a Gaussian observation (the time trend), I apologize.

Clearing up any misconceptions on my part, if any, would be greatly appreciated.

If all of the above assumptions are correct, then I believe this code can be used to obtain the generating process for the posterior-predictive distribution.

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=0> J; //Number of groups
  int<lower=0> K; //Number of group predictors
  int<lower=0> O; //Number of observation predictor
  int<lower=1> T; //Number of time indexes.
  int<lower=1, upper=T> t_idx[N]; // Maps each observation to each time period
  int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
  int<lower=0> max_group_idx; //Highest group index fit in model
  int<lower=0> max_t_idx; //Highest time index fit in model
  real response[N]; // The outcome of the data generation process
  real response_by_t[max_t_idx]; //observed average for gaussian process.

  int N_samples; // number of samples drawn from posterior distributions
  real nu[N_samples]; // nu parameter from student t likelihood
  real phi[N_samples]; // standard deviation parameter from student t likelihood
  matrix[N_samples, O] beta; // covariate matrix draws for observations
  matrix[N_samples, K] gamma; // covariate matrix draws for groups
  matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
  real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.

  // GP posterior samples
  // Noise component
  matrix[N_samples, max_t_idx] t_noise;
  matrix[N_samples, max_t_idx] gp_z;
  real sigma_noise[N_samples];

  // Auto-regressive component
  real sigma_gp_ar[N_samples];
  real gp_len_ar[N_samples];

  real sigma_gp_trend_short[N_samples];
  real gp_len_trend_short[N_samples];

  real sigma_gp_trend_long[N_samples];
  real gp_len_trend_long[N_samples];
}
transformed data {
  real t1_vec[max_t_idx];
  real t2_vec[T - max_t_idx];
  real jitter = 1e-12;
  for (t in 1:max_t_idx)
    t1_vec[t] = t;
  for (t in 1:T - max_t_idx)
    t2_vec[t] = t + max_t_idx;
}
parameters {
}
model {
}
generated quantities {
  matrix[N_samples, N] y_pred;
  matrix[N_samples, T] gp;

  // generate the gp vector from the posterior samples.
  for(s in 1:N_samples) {
    vector[max_t_idx] f1;
    vector[T - max_t_idx] f2;

    {
       matrix[max_t_idx, max_t_idx] L_C_t1;

       matrix[max_t_idx, max_t_idx] C_t1;
       matrix[max_t_idx, T - max_t_idx] C_t1_t2;
       matrix[T - max_t_idx, max_t_idx] C_t2_t1;
       matrix[T - max_t_idx, T - max_t_idx] C_t2;

       matrix[T - max_t_idx, max_t_idx] A;

       C_t1 = gp_matern32_cov(t1_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C_t1 += gp_exp_quad_cov(t1_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C_t1 += gp_exp_quad_cov(t1_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

       C_t1_t2 = gp_matern32_cov(t1_vec, t2_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C_t1_t2 += gp_exp_quad_cov(t1_vec, t2_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C_t1_t2+= gp_exp_quad_cov(t1_vec, t2_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

       C_t2_t1 = gp_matern32_cov(t2_vec, t1_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C_t2_t1 += gp_exp_quad_cov(t2_vec, t1_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C_t2_t1 += gp_exp_quad_cov(t2_vec, t1_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

       C_t2 = gp_matern32_cov(t2_vec, sigma_gp_ar[s], gp_len_ar[s]);
       C_t2 += gp_exp_quad_cov(t2_vec, sigma_gp_trend_short[s], gp_len_trend_short[s]);
       C_t2 += gp_exp_quad_cov(t2_vec, sigma_gp_trend_long[s], gp_len_trend_long[s]);

       if (size(t2_vec) > 0) {
         A = C_t2_t1 * inverse(C_t1 + diag_matrix(rep_vector(square(sigma_noise[s]), max_t_idx)));
         f2 = multi_normal_rng(A * to_vector(response_by_t), C_t2 - A * C_t1_t2);
       }

       for (t in 1:max_t_idx) {
         C_t1[t, t] += jitter;
       }

       // Reuse samples from noise and multivariate normal for f1.
       L_C_t1 = cholesky_decompose(C_t1);
       f1 = L_C_t1 * to_vector(gp_z[s]) + to_vector(t_noise[s]);

       gp[s] = to_row_vector(append_row(f1, f2));
    }
  }

  for(n in 1:N) {
    row_vector[N_samples] group_mu_rep;
    row_vector[N_samples] mu;

    if (group_idx[n] > max_group_idx) {
      for (s in 1:N_samples) {
         group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
      }
    }
    else {
      group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
    }

    mu = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n])) + to_row_vector(gp[,t_idx[n]]);

    y_pred[,n] = to_vector(student_t_rng(nu, mu, phi));

  }
}

anon79882417 · July 27, 2020, 10:34pm

Thanks for making this so clear and easy to read. I now see why you’re asking this question.

Your understanding is mostly fine, my intent wasn’t to be condescending. I appreciate your questions and I realize that our documentation could be more clear.

Here are the main things I think you’re missing:

This \textbf{y} would just be the time series of responses.
You have to implement the code for the predictive equations.
Start simple.

First, I’m providing a simple example on simulated data below. Hopefully understanding this will enable us to see how to incorporate this method into a larger model. This is a simplified version of example 21.2 in BDA3.

Here’s the function I’m simulating (r-code below, I’ve mean centered and standardized in some areas, but you can re-scale after fitting the model):

y_t(x) = f_1(x) + f_2(x) + \epsilon_t

For:

f_1(x) = 2 x (a simple linear function)
f_2(x) = sin(\frac{x}{2\pi}) (a sine wave)

Which looks like this:

And I’m going to try to recover f_1 and f_2. So I use the following model:

f_1(x) \sim GP(0, k_1), k_1(x, x') = \sigma_1^2exp(\frac{d(x,x')^2}{l_1^2})
f_2(x) \sim GP(0, k_2), k_2(x, x') = \sigma_2^2 (x, x')

For (x, x') an inner product, and d(x, x') the l-2 norm/euclidean distance.

A caveat is, for combining kernels, you have to specify what trend you’d like to make predictions for. So if you want to make predictions of the time series of just the observed linear trend, you’d need to implement the posterior predictive accordingly. So we’ve constructed the following covariance matrix:

k_{total} = k_1 + k_2

If we want to make predictions for k_{1,(\tilde{x}, x)}, the linear function, we can use the equations you’ve found in GPML, but for the cross covariance matrix (the left most one for the posterior predictive mean), you use the covariance structure you’d like to make predictions for. So, for the linear kernel:

\tilde{f}_1 = k_{1,(\tilde{x}, x)} (k_{total} + I \sigma_n^2)^{-1}\textbf{y}

Where k_1 is a cross-covariance matrix computed from the linear kernel, using data we’ve trained our model on and data we’d like to make predictions for (I just did seq(0,100, length.out = 1000) in R). The Stan code for the posterior predictive mean of a GP, which you need to implement in the functions block, being:

  vector gp_pred_dot_prod_rng(vector[] x2,
                              vector y, vector[] x1,
                              real m, real l,
                              real sig,
                              real sigma) {
    int N1 = rows(y);
    int N2 = size(x2);
    vector[N2] f;
    {
      matrix[N1, N1] K = add_diag(gp_exp_quad_cov(x1, m, l) +
                                  gp_dot_prod_cov(x1, sig), sigma);
      matrix[N1, N1] L_K = cholesky_decompose(K);
      vector[N1] L_K_div_y = mdivide_left_tri_low(L_K, y);
      vector[N1] K_div_y = mdivide_right_tri_low(L_K_div_y', L_K)';
      matrix[N1, N2] k_x1_x2 = gp_dot_prod_cov(x1, x2, sig);
      f = (k_x1_x2' * K_div_y);
    }
    return f;
  }

And plotting the estimated \tilde{f}_1 against the simulated f_1:

And then, analogously, for the squared-exponential:

\tilde{f}_2 = k_{2,(\tilde{x}, x)} (k_{total} + I_N \sigma^2)^{-1}\textbf{y}

And finally, the additive model:

\tilde{f}_{total} = k_{total,(\tilde{x}, x)} (k_{total} + I_N \sigma^2)^{-1}\textbf{y}

It’s not perfect, but I hope that it gets the point across. There’s lots of design choices the modeler needs to make. For example, I should have used the periodic covariance function (gp_periodic_cov) instead of the squared exponential, and fixed the period if this information was known. I could have modeled the noise more explicitly for each function. I should have put more effort into thinking about priors, I just used normal(0, 1) to be quick.

I’ve attached all the code (R and Stan code). I understand that I’ve used my own notation and language, and this doesn’t at all unify with the notation in the Stan-Con case study you’ve been using, so I’m happy if you have any questions, and I’ll return to discuss how to do this with the Stan-con example at a later time.

gp_discourse_sim.R (1.9 KB)
linear_exp_quad_gp.stan (3.1 KB)

matthew · July 27, 2020, 10:47pm

Thank you very much for the detailed example with the simulated data. The example makes sense and luckily I believe matches the code I’ve shared above.

However, from the data that I have, there are multiple observations for each time period and I wish to make a prediction of a single observation. With that in mind, am I forming the time series \textbf{y} correctly by taking the mean of the response grouped by each historical time index?

mike-lawrence · July 29, 2020, 1:15pm

If you are modelling the replicates per timepoint as normal deviations from a mean function (the result of computing f1(x)+f2(x) in @anon79882417’s example), then the best prediction for an out-of-sample point is going to be the prediction from the mean function.

matthew · July 29, 2020, 1:31pm

Hi Mike,

Does that still apply when used as I have done in the following model?

Particular emphasis on each time index having multiple observations associated to it, like in the table here:

matthew:

observation_number group_index time_index unit_level_covariate_1 unit_level_covariate_2 group_level_covariate_1 response

1 1 1 -0.1 0.2 0.8 y1y_1

2 1 1 0.4 -0.6 0.8 y2y_2

3 2 1 0.34 -0.45 0.5 y3y_3

4 2 1 -0.25 0.3 0.5 y4y_4

5 2 1 0.24 5 0.5 y5y_5

6 3 1 -0.64 5 0.2 y6y_6

7 1 2 0.2 0.2 0.8 y7y_7

8 1 2 0.6 0.35 0.8 y8y_8

9 1 2 0.8 0.54 0.8 y9y_9

10 2 2 0.9 0.6 0.5 y10y_{10}

11 2 2 0.2 0.5 0.5 y11y_{11}

12 4 2 0.1 0.5 0.4 y12y_{12}

13 4 2 0.6 -0.3 0.4 y13y_{13}

14 5 2 0.7 0.9 0.1 y14y_{14}

mike-lawrence · July 29, 2020, 1:35pm

If you are not modelling an interaction between the covariate and the GP, then yes, my statement applies. Take a look at the code here for a walk-through of GP regression with replicated observations.

matthew · July 29, 2020, 1:38pm

So, what do I use for the observed vector \textbf{y} for out-of-sample predictions? The historical weekly average values, which are at a higher level of aggregation than the observations in the likelihood?

That is the “observed” vector of the latent GP.

mike-lawrence · July 29, 2020, 3:51pm

Sorry, I chimed in without thoroughly digesting the full thread. when you say out-of-sample, do you mean new timepoints that are outside the domain of the original data, new timepoints that are in the domain of the original data but fall between the sampled-timepoints in the original data, or simply new predictions for the timepoints from the original data?

matthew · July 29, 2020, 3:55pm

I mean the former, so new timepoints that are outside the domain of the original data.

mike-lawrence · July 29, 2020, 4:13pm

Ah, darn, then I might have to retract and bow-out; I have no experience with prediction outside the domain of the original data. Indeed, by looking at everything you’ve done here I suspect you might be more expert than I am in GP stuff in general too!

I will say that so far as I understand, the fitting code from your prior post yields samples for the parameter gp at each timepoint, reflecting the latent mean GP function, and I think it is these values that you need to use as the “observed” values in your subsequent posterior prediction run (rather than attempting to compute an average on the actual observed data per timepoint).

matthew · July 29, 2020, 4:14pm

That makes perfect sense! I’ll give that a shot. Thank you.

Topic		Replies	Views
Generate replicated data for out-of-sample data for gaussian process Modeling gaussian-process	4	760	July 29, 2020
Modeling time dynamics with GP Modeling specification , gaussian-process	12	650	April 29, 2020
GP Docs - Predictive Inference for Gaussian Processes Proposals Developers gaussian-process , docu	6	849	February 5, 2024
Gaussian Process Regression with Student likelihood General gaussian-process	7	1018	February 8, 2019
Time series and Gaussian process Modeling gaussian-process	62	4556	December 20, 2019

observation_number	group_index	time_index	unit_level_covariate_1	unit_level_covariate_2	group_level_covariate_1	response
1	1	1	-0.1	0.2	0.8	y1y_1
2	1	1	0.4	-0.6	0.8	y2y_2
3	2	1	0.34	-0.45	0.5	y3y_3
4	2	1	-0.25	0.3	0.5	y4y_4
5	2	1	0.24	5	0.5	y5y_5
6	3	1	-0.64	5	0.2	y6y_6
7	1	2	0.2	0.2	0.8	y7y_7
8	1	2	0.6	0.35	0.8	y8y_8
9	1	2	0.8	0.54	0.8	y9y_9
10	2	2	0.9	0.6	0.5	y10y_{10}
11	2	2	0.2	0.5	0.5	y11y_{11}
12	4	2	0.1	0.5	0.4	y12y_{12}
13	4	2	0.6	-0.3	0.4	y13y_{13}
14	5	2	0.7	0.9	0.1	y14y_{14}

Gaussian Process out-of-sample predictive distribution

Related topics