Non-centred parametrization of covariance matrix causes many divergent transitions and slows down sampling

I am fitting a model with a large covariance matrix (N_{obs} \times N_{obs}) combining two Gaussian processes and a varying intercept for a grouping variable. With centred parametrization, the model runs well on subsets of data, producing only a handful of divergent transitions at the begining due to non-positive definite covariance matrix. The centred model looks as follows:

functions{
  // Ornstein-Uhlenbeck + exponentiated quadratic covariance function
  matrix cov_GP( matrix P, matrix S, matrix C, real eta_p, real rho_p, real eta_s, real rho_s, real sigma_c, real sigma_n ) {
    real sq_eta_p = square( eta_p );
    real sq_eta_s = square( eta_s );
    real sq_sigma_c = square( sigma_c );
    real sq_sigma_n = square( sigma_n );
    int N = dims( P )[1];
    matrix[N, N] K;
    for (i in 1:(N - 1)) {
      K[i, i] = sq_eta_p + sq_eta_s + sq_sigma_c + sq_sigma_n;
      for (j in (i + 1):N) {
        K[i, j] = sq_eta_p * exp( -P[i, j] / rho_p ) + // phylogenetic Ornstein-Uhlenbeck covariance function
                  sq_eta_s * exp( -0.5 * square( S[i, j] / rho_s ) ) + // spatial exponentiated quadratic covariance function
                  sq_sigma_c * C[i, j]; // collector ID varying intercept, C[i, j] is 1 when both observations were collected by the same person, 0 otherwise
        K[j, i] = K[i, j];
      }
    }
    K[N, N] = sq_eta_p + sq_eta_s + sq_sigma_c + sq_sigma_n;
    return K;
  }
}

data {
  int<lower=1> N;
  vector[N] y_var;
  vector[N] x_var;
  matrix[N, N] P_mat;
  matrix[N, N] S_mat;
  matrix[N, N] C_mat;
}

parameters {
  real a;
  real b;
  real<lower=0> eta_p;
  real<lower=0> rho_p;
  real<lower=0> eta_s;
  real<lower=0> rho_s;
  real<lower=0> sigma_c;
  real<lower=0> sigma_n;
}

model {
  matrix[N, N] K = cov_GP( P_mat, S_mat, C_mat, eta_p, rho_p, eta_s, rho_s, sigma_c, sigma_n );
  
  vector[N] mu;
  
  a ~ normal( 0 , 1 );
  b ~ normal( 0 , 0.5 );
  eta_p ~ normal( 0 , 1 );
  rho_p ~ inv_gamma( 1.5 , 0.057 );
  eta_s ~ normal( 0 , 1 );
  rho_s ~ inv_gamma( 1.5 , 0.057 );
  sigma_c ~ normal( 0 , 1 );
  sigma_n ~ normal( 0 , 1 );
  
  mu =  a + b * x_var;
  
  y_var ~ multi_normal( mu , K );
}

But sampling takes a lot of time when number of observations increases, so I have tried a non-centred parametrization as it should be more efficient according to the documentation (reducing dependencies and avoiding inversion of covariance matrix). The non-centred model looks as follows:

functions{
  // Ornstein-Uhlenbeck + exponentiated quadratic covariance function
  matrix cov_GP( matrix P, matrix S, matrix C, real eta_p, real rho_p, real eta_s, real rho_s, real sigma_c, real jitter ) {
    real sq_eta_p = square( eta_p );
    real sq_eta_s = square( eta_s );
    real sq_sigma_c = square( sigma_c );
    int N = dims( P )[1];
    matrix[N, N] K;
    for (i in 1:(N - 1)) {
      K[i, i] = sq_eta_p + sq_eta_s + sq_sigma_c + jitter;
      for (j in (i + 1):N) {
        K[i, j] = sq_eta_p * exp( -P[i, j] / rho_p ) + // phylogenetic Ornstein-Uhlenbeck covariance function
                  sq_eta_s * exp( -0.5 * square( S[i, j] / rho_s ) ) + // spatial exponentiated quadratic covariance function
                  sq_sigma_c * C[i, j]; // collector ID varying intercept, C[i, j] is 1 when both observations were collected by the same person, 0 otherwise
        K[j, i] = K[i, j];
      }
    }
    K[N, N] = sq_eta_p + sq_eta_s + sq_sigma_c + jitter;
    return K;
  }
}

data {
  int<lower=1> N;
  vector[N] y_var;
  vector[N] x_var;
  matrix[N, N] P_mat;
  matrix[N, N] S_mat;
  matrix[N, N] C_mat;
}

parameters {
  real a;
  real b;
  real<lower=0> eta_p;
  real<lower=0> rho_p;
  real<lower=0> eta_s;
  real<lower=0> rho_s;
  real<lower=0> sigma_c;
  real<lower=0> sigma_n;
  vector[N] z;
}

model {
  matrix[N, N] K = cov_GP( P_mat, S_mat, C_mat, eta_p, rho_p, eta_s, rho_s, sigma_c, 1e-3 );
  matrix[N, N] L = cholesky_decompose( K );
  
  vector[N] mu;
  
  a ~ normal( 0 , 1 );
  b ~ normal( 0 , 0.5 );
  eta_p ~ normal( 0 , 1 );
  rho_p ~ inv_gamma( 1.5 , 0.057 );
  eta_s ~ normal( 0 , 1 );
  rho_s ~ inv_gamma( 1.5 , 0.057 );
  sigma_c ~ normal( 0 , 1 );
  sigma_n ~ normal( 0 , 1 );
  z ~ normal( 0 , 1 );
  
  mu = a + b * x_var + L * z;
  
  y_var ~ normal( mu , sigma_n );
}

However, the non-centred parametrization extremely increases the number of divergent transitions due to non-positive definite covariance matrix. With jitter <0.001 the model finishes unexpectedly and with jitter 0.001 there are 99% of divergent transitions and the model runs much longer then the centred parametrization.

What could be the problem here? In most of the tutorials I have seen, jitter between 1e-9 and 1e-12 is used. Why in my case only relatively large jitter works and even the high values produce tons of divergent transitions? Is my non-centred model somehow misspecified?

I would also greatly appreciate any other suggestions how to make the model as efficient as possible. Thank you.

If I recall correctly, amount of jitter required depends on the scale of the other parameters.

Otherwise I’m not seeing anything obviously wrong in your implementation other than a few performance suggestions:

Since GP covariances can be summed, and since your spatial component is a standard kernel, you might compute that one with cov_exp_quad, which in my experience has internal optimizations that make it faster than doing it by hand. Also, summing three covariance matrices as a single operation would probably again be faster thanks to low-level optimizations compared to the looped sum inside your function.

For the covariance matrix associated with C & sigma_c, I feel like there’s an indexing way to create this rather than a multiplication. Oh! Yes, in R, do:

which_C = which(as.vector(C)==1) #vector of ints indicating which column-major-flattened elements of C are 1

Then in Stan:

transformed data{
    int square_N = square(N) ;
    vector[square_N] zeroes = zeroes_vector(square_N) ;
}
...
model{
   vector[square_N] C_flat = zeroes ;
   C_flat[which_C] += square(sigma_c) ;
   matrix[N,N] C = to_matrix(C_flat) ;
...
}

Finally, if you’re looking to eek out some final performance efficiencies, don’t declare the variables K, L and mu but instead nest all the computation in the final call to normal(). Apparently declaring new variables in the model/TP blocks slows down AutoDiff. If you want those variables in the posterior, recompute them in the GQ block (or wait until the sampling completes and run a standalone GQ block).

(not sure how to do the C_cov-by-index trick without declaring though. I wonder if the “don’t declare new variables” rule applies to variables declared inside user-defined functions? If not, you could put the trick inside a function)

Thank you very much for your suggestions.

Is this what you mean by summing the three covariance matrices?

functions{
  matrix cov_GP( int N, matrix P, matrix S, matrix C, real eta_p, real rho_p, real eta_s, real rho_s, real sigma_c, real sigma_n ) {
    real sq_eta_p = square( eta_p ) ;
    real sq_eta_s = square( eta_s ) ;
    real sq_sigma_c = square( sigma_c ) ;
    real sq_sigma_n = square( sigma_n ) ;

    // cov_P    
    matrix[N, N] cov_P;
    for (i in 1:(N - 1)) {
      for (j in (i + 1):N) {
        cov_P[i, j] = sq_eta_p * exp( -P[i, j] / rho_p ) ;
        cov_P[j, i] = K[i, j] ;
      }
    }

    // cov_S
    matrix[N, N] cov_S ;
    for (i in 1:(N - 1)) {
      for (j in (i + 1):N) {
        cov_S[i, j] = sq_eta_s * exp( -0.5 * square( S[i, j] / rho_s ) ) ;
        cov_S[j, i] = K[i, j] ;
      }
    }

    // cov_C
    int square_N = square(N) ;
    vector[square_N] zeroes = zeroes_vector(square_N) ;
    vector[square_N] C_flat = zeroes ;
    C_flat[which_C] += square(sigma_c) ;
    matrix[N,N] C_cov = to_matrix(C_flat) ;

    // final covariance matrix
    K = cov_P + cov_S + cov_C ;
    K = add_diag( K, sq_eta_p + sq_eta_s + sq_sigma_n) ;
    return K;
}

Does not it take three times more memory to build three separate covariance matrices?

But as far as I know, cov_exp_quad takes a variable (vector of length N) and calculates the differences between the observations. But I have a matrix (P) of phylogenetic distances between (species belonging to) the observations calculated from a phylogenetic tree. I guess there is no way to send it to cov_exp_quad, or is it?

Good point, though unless you hit a RAM limit with my suggestion, I don’t off-the-bat think it’ll be slower and suspect it will be faster.

Correct, so cov_exp_quad will be for the S covariance only.

OK, I will definitelly give it a try, thanks!

Sorry, my mistake. But it applies to spatial distances too because I have a matrix S containing spatial distances (calculated in GIS based on GPS coords) between observations and I do not think I can tranform it into a variable with a single datum for each observation to send it to cov_exp_quad. The observations are distributed on the Earth’s surface, so there is probably no way how to linearize this variable.

I somewhat forgot about my main question. The centred model estimated (medians of) the parameters as follows: eta_p = 0.57, rho_p = 0.13, eta_s = 0.24, rho_s = 0.03, sigma_c = 0.37, and sigma_n = 0.63. Is it then possible that 0.001 is still too small as a jitter?

hm, no, it actually looks like it’d be too big. I suggest trying a series of sub-models with just the one GP to see if you can discern which might be causing issues.

I tested it with two GP covariance matrices (i.e. excluding cov_C) and, unfortunaltely, preparing two matrices followed by their summation was the slower method (783 s vs. 620 s). Perhaps the need to traverse two matrices offset the potential time savings.