I am fitting a model with a large covariance matrix (N_{obs} \times N_{obs}) combining two Gaussian processes and a varying intercept for a grouping variable. With centred parametrization, the model runs well on subsets of data, producing only a handful of divergent transitions at the begining due to non-positive definite covariance matrix. The centred model looks as follows:
functions{
// Ornstein-Uhlenbeck + exponentiated quadratic covariance function
matrix cov_GP( matrix P, matrix S, matrix C, real eta_p, real rho_p, real eta_s, real rho_s, real sigma_c, real sigma_n ) {
real sq_eta_p = square( eta_p );
real sq_eta_s = square( eta_s );
real sq_sigma_c = square( sigma_c );
real sq_sigma_n = square( sigma_n );
int N = dims( P )[1];
matrix[N, N] K;
for (i in 1:(N - 1)) {
K[i, i] = sq_eta_p + sq_eta_s + sq_sigma_c + sq_sigma_n;
for (j in (i + 1):N) {
K[i, j] = sq_eta_p * exp( -P[i, j] / rho_p ) + // phylogenetic Ornstein-Uhlenbeck covariance function
sq_eta_s * exp( -0.5 * square( S[i, j] / rho_s ) ) + // spatial exponentiated quadratic covariance function
sq_sigma_c * C[i, j]; // collector ID varying intercept, C[i, j] is 1 when both observations were collected by the same person, 0 otherwise
K[j, i] = K[i, j];
}
}
K[N, N] = sq_eta_p + sq_eta_s + sq_sigma_c + sq_sigma_n;
return K;
}
}
data {
int<lower=1> N;
vector[N] y_var;
vector[N] x_var;
matrix[N, N] P_mat;
matrix[N, N] S_mat;
matrix[N, N] C_mat;
}
parameters {
real a;
real b;
real<lower=0> eta_p;
real<lower=0> rho_p;
real<lower=0> eta_s;
real<lower=0> rho_s;
real<lower=0> sigma_c;
real<lower=0> sigma_n;
}
model {
matrix[N, N] K = cov_GP( P_mat, S_mat, C_mat, eta_p, rho_p, eta_s, rho_s, sigma_c, sigma_n );
vector[N] mu;
a ~ normal( 0 , 1 );
b ~ normal( 0 , 0.5 );
eta_p ~ normal( 0 , 1 );
rho_p ~ inv_gamma( 1.5 , 0.057 );
eta_s ~ normal( 0 , 1 );
rho_s ~ inv_gamma( 1.5 , 0.057 );
sigma_c ~ normal( 0 , 1 );
sigma_n ~ normal( 0 , 1 );
mu = a + b * x_var;
y_var ~ multi_normal( mu , K );
}
But sampling takes a lot of time when number of observations increases, so I have tried a non-centred parametrization as it should be more efficient according to the documentation (reducing dependencies and avoiding inversion of covariance matrix). The non-centred model looks as follows:
functions{
// Ornstein-Uhlenbeck + exponentiated quadratic covariance function
matrix cov_GP( matrix P, matrix S, matrix C, real eta_p, real rho_p, real eta_s, real rho_s, real sigma_c, real jitter ) {
real sq_eta_p = square( eta_p );
real sq_eta_s = square( eta_s );
real sq_sigma_c = square( sigma_c );
int N = dims( P )[1];
matrix[N, N] K;
for (i in 1:(N - 1)) {
K[i, i] = sq_eta_p + sq_eta_s + sq_sigma_c + jitter;
for (j in (i + 1):N) {
K[i, j] = sq_eta_p * exp( -P[i, j] / rho_p ) + // phylogenetic Ornstein-Uhlenbeck covariance function
sq_eta_s * exp( -0.5 * square( S[i, j] / rho_s ) ) + // spatial exponentiated quadratic covariance function
sq_sigma_c * C[i, j]; // collector ID varying intercept, C[i, j] is 1 when both observations were collected by the same person, 0 otherwise
K[j, i] = K[i, j];
}
}
K[N, N] = sq_eta_p + sq_eta_s + sq_sigma_c + jitter;
return K;
}
}
data {
int<lower=1> N;
vector[N] y_var;
vector[N] x_var;
matrix[N, N] P_mat;
matrix[N, N] S_mat;
matrix[N, N] C_mat;
}
parameters {
real a;
real b;
real<lower=0> eta_p;
real<lower=0> rho_p;
real<lower=0> eta_s;
real<lower=0> rho_s;
real<lower=0> sigma_c;
real<lower=0> sigma_n;
vector[N] z;
}
model {
matrix[N, N] K = cov_GP( P_mat, S_mat, C_mat, eta_p, rho_p, eta_s, rho_s, sigma_c, 1e-3 );
matrix[N, N] L = cholesky_decompose( K );
vector[N] mu;
a ~ normal( 0 , 1 );
b ~ normal( 0 , 0.5 );
eta_p ~ normal( 0 , 1 );
rho_p ~ inv_gamma( 1.5 , 0.057 );
eta_s ~ normal( 0 , 1 );
rho_s ~ inv_gamma( 1.5 , 0.057 );
sigma_c ~ normal( 0 , 1 );
sigma_n ~ normal( 0 , 1 );
z ~ normal( 0 , 1 );
mu = a + b * x_var + L * z;
y_var ~ normal( mu , sigma_n );
}
However, the non-centred parametrization extremely increases the number of divergent transitions due to non-positive definite covariance matrix. With jitter <0.001 the model finishes unexpectedly and with jitter 0.001 there are 99% of divergent transitions and the model runs much longer then the centred parametrization.
What could be the problem here? In most of the tutorials I have seen, jitter between 1e-9 and 1e-12 is used. Why in my case only relatively large jitter works and even the high values produce tons of divergent transitions? Is my non-centred model somehow misspecified?
I would also greatly appreciate any other suggestions how to make the model as efficient as possible. Thank you.