Gaussian process and QR reparametrization

twitterbitter · August 6, 2021, 10:16am

Dear all

I am new to stan and trying to fit the following model for georeferenced data: Y_i \sim Bin(n_i, p_i) with link logit(p_i) = \alpha + X\beta + w_i, where w_1, \ldots, w_n \sim MVN(0, \Sigma) and \Sigma = \sigma_s^2 \exp(-\phi \cdot d_{ij}) if i \neq j and \Sigma = \sigma_{ns}^2 + \sigma_s^2 else. This would give me the following model:

data {
  int<lower=0> N;
  int<lower=0> K;
  matrix[N, K] x;
  int<lower = 0> y[N];
  int<lower = 0> n[N];
  matrix[N, N] dist;
}

parameters {
  real alpha;
  vector[K] beta;
  vector[N] w;
  real<lower = 0>  sigma_ns2;
  real<lower = 0>  sigma_s2;
  real<lower = 0>  phi;
}

transformed parameters{
 vector[N] mu = rep_vector(0, N);
}

model {
  matrix[N,N] Sigma;
  matrix[N,N] L;
  for(i in 1:(N-1)){
   for(j in (i+1):N){
     Sigma[i,j] = sigma_s2*exp((-1)*phi*dist[i,j]);
     Sigma[j,i] = Sigma[i,j];
   }
  }
  for(i in 1:N) Sigma[i,i] = sigma_ns2 + sigma_s2;
  L = cholesky_decompose(Sigma);
  
  alpha ~ normal(0, 100);
  for(i in 1:K){
    beta[i] ~ normal(0, 100);
  }

  sigma_ns2 ~ inv_gamma(0.01, 0.01);
  sigma_s2 ~ inv_gamma(0.01, 0.01);
  phi ~ uniform(0.005, 0.1);
  
  w ~ multi_normal_cholesky(mu, L);
  y ~ binomial_logit(n, alpha + x * beta + w);
}

Now in the stan users guide (1.2 The QR Reparameterization | Stan User’s Guide) it says one should use QR factorization for regressions. Does this also apply here? I tried to modify my model to

data {
  int<lower=0> N;
  int<lower=0> K;
  matrix[N, K] x;
  int<lower = 0> y[N];
  int<lower = 0> n[N];
  matrix[N, N] dist;
}

parameters {
  real alpha;
  vector[K] theta;
  vector[N] w;
  real<lower = 0>  sigma_ns2;
  real<lower = 0>  sigma_s2;
  real<lower = 0>  phi;
}

transformed parameters{
 vector[N] mu = rep_vector(0, N);
  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;
  // thin and scale the QR decomposition
  Q_ast = qr_Q(x)[, 1:K] * sqrt(N - 1);
  R_ast = qr_R(x)[1:K, ] / sqrt(N - 1);
  R_ast_inverse = inverse(R_ast);
}

model {
  matrix[N,N] Sigma; // maybe changed to cov_matrix[N]
  matrix[N,N] L; //maybe changed to cholesky_factor_cov[N]
  for(i in 1:(N-1)){
   for(j in (i+1):N){
     Sigma[i,j] = sigma_s2*exp((-1)*phi*dist[i,j]);
     Sigma[j,i] = Sigma[i,j];
   }
  }
  for(i in 1:N) Sigma[i,i] = sigma_ns2 + sigma_s2;
  L = cholesky_decompose(Sigma);
  
  alpha ~ normal(0, 100);
  for(i in 1:K){
    theta[i] ~ normal(0, 100);
  }

  sigma_ns2 ~ inv_gamma(0.01, 0.01);
  sigma_s2 ~ inv_gamma(0.01, 0.01);
  phi ~ uniform(0.005, 0.1);
  
  w ~ multi_normal_cholesky(mu, L);
  y ~ binomial_logit(n, alpha + Q_ast * theta + w);

}

generated quantities {
  vector[K] beta;
  beta = R_ast_inverse * theta; // coefficients on x
}

When I run both models with low numbers of iterations, the QR model takes a longer time to run, the effective sample size is hard to compare because I run the model with low number of iterations so far because it takes a rather long time to run anyways. My questions are the following:

Is it always appropriate to use the QR model as it says in the documentation?
Is there a generic covariance function for my model? (which would probably speed up the estimation)
Are there other obvious optimization steps that I miss?
Does it improve anything if I use cov_matrix instead of matrix?

Thank you in advance for any help. Also, if you may know examples of exactly this case, I would be greatful.
Dan

mike-lawrence · August 6, 2021, 11:26am

The more the covariates are intercorrelated, the more it makes sense. There’s very slight compute cost to computing beta in the GQ section, so the QR trick might be skipped in cases where the covariates are explicitly manipulated in experimental settings (vs observational settings) to have zero correlations.

You’re currently pre-computing the distance matrix, but last I checked (n.b. pretty long ago) it’s just as fast to use the built-in cov_exp_quad

Since all your matrices are quantities derived from parameters, no. If you had a matrix parameter, using cov_matrix would tell the sampler to only sample positive semi-definite matrices.

Yes, here’s your second model with a collection of improvements (lines marked with //)

data {
  int<lower=0> N;
  int<lower=0> K;
  matrix[N, K] x;
  int<lower = 0> y[N];
  int<lower = 0> n[N];
  matrix[N, N] dist;
}
transformed data{

  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;
  // thin and scale the QR decomposition
  Q_ast = qr_Q(x)[, 1:K] * sqrt(N - 1);
  R_ast = qr_R(x)[1:K, ] / sqrt(N - 1);
  R_ast_inverse = inverse(R_ast);
}

parameters {
  real alpha;
  vector[K] theta;
  vector[N] w_eta; //
  real<lower = 0>  sigma_ns2; // I'm not used to seeing this term as an estimated quantity in GPs; usually it's a data input and set as very small simply to ensure positive definiteness of the covariance. See: https://mc-stan.org/docs/2_27/stan-users-guide/fit-gp-section.html
  real<lower = 0>  sigma_s2;
  real<lower = 0>  phi;
}


model {
  matrix[N,N] Sigma; 
  for(i in 1:(N-1)){
   for(j in (i+1):N){
     Sigma[i,j] = sigma_s2*exp((-1)*phi*dist[i,j]);
     Sigma[j,i] = Sigma[i,j];
   }
  }
  for(i in 1:N) Sigma[i,i] = sigma_ns2 + sigma_s2;

  
  alpha ~ normal(0, 100);
  theta ~ normal(0, 100); // vectorized

  sigma_ns2 ~ inv_gamma(0.01, 0.01);
  sigma_s2 ~ inv_gamma(0.01, 0.01);
  phi ~ uniform(0.005, 0.1);
  
  w_eta ~ std_normal() ; // usually makes sense to non-center latent GPs

  y ~ bernoulli_logit(
    n
    , alpha + 
      cholesky_decompose(Sigma) * w_eta + 
      Q_ast * theta 
  ); // Not sure if this is already fastest, or if you should try: 
  // y_as_01 ~ bernoulli_logit_glm(
 //    Q_ast
 //    , alpha + cholesky_decompose(Sigma)*w_eta
 //    , theta
 // ) ;
 //or maybe: 
  // y_as_01 ~ bernoulli_logit_glm(
 //    cholesky_decompose(Sigma)
 //    , alpha + Q_ast * theta
 //    , w_eta
 // ) ;
// (Depending on the N & K, one or the other of these alternatives might be faster than the other; but both might still be slower than your original bernoulli_logit)
}

generated quantities {
  vector[K] beta;
  beta = R_ast_inverse * theta; // coefficients on x
}

twitterbitter · August 6, 2021, 12:23pm

Thank you very much for your fast and thorough response, that helps a lot. I’ll check out your suggestions for speeding up my model estimation.

twitterbitter · August 9, 2021, 9:44am

I have looked through your answer again and came across your comment on my sigma_ns2.

mike-lawrence:

  real<lower = 0>  sigma_ns2; // I'm not used to seeing this term as an estimated quantity in GPs; usually it's a data input and set as very small simply to ensure positive definiteness of the covariance. See: https://mc-stan.org/docs/2_27/stan-users-guide/fit-gp-section.html

This is not supposed to be the \delta which ensures numeric stability. Rather than that, I meant it to be the so called nugget, which is a i.i.d. random effect which should capture measurement error or microspatial effects, see Banerjee - Hierarchical modelling and analysis for spatial data (2nd ed.), p. 26-27. I guess if I add this, the numerical stability for the cholesky factorization is given naturally without adding the fixed \delta.

mike-lawrence · August 9, 2021, 11:11am

Very cool. I’ll have to remember to play with using that in my own GPs.

Be sure to report back on your results if you work out which of the three versions of the likelihood samples better/faster.

betanalpha · August 23, 2021, 8:07pm

Just to clarify, the question of whether or not to the apply a QR reparameterization to the linear term x * beta is independent of the gaussian process contribution. In other words adding the gaussian process doesn’t change the correlation of the covariates, which is what motivates the QR reparameterization.

That said adding the gaussian process contribution can potentially change how correlated the beta parameters end up in the posterior distribution which can affect the performance of the QR reparameterization, but that will depend on the intimate details of the observed data.

The nugget arises when convolving a latent gaussian process with additional normal variation at each observed point. Because the gaussian process being included in the model here is not directly observed there’s no measurement variability for which to account, but the term could be used to capture local influences that corrupt the relationship between the latent gaussian process and the regression mean.

twitterbitter · September 8, 2021, 3:12pm

Thanks for the clarifications, this does make sense to me.

Topic		Replies	Views
Gaussian Process Prediction Modeling rstan , techniques , fitting-issues	2	481	March 22, 2021
Reparametrization of Gaussian process model with measurement error Modeling reparametrization	4	747	July 5, 2021
Formulating a hierarchical multivariate Gaussian process using the Hilbert space approximation Modeling cmdstanr , gaussian-process	8	220	February 7, 2025
Kernel parameter estimation in multivariate Gaussian Process Regression Modeling gaussian-process	4	918	April 27, 2020
Applying Cholesky factorization to a QR Modeling	6	664	August 20, 2021

Gaussian process and QR reparametrization

Related topics