I am building a phylogenetic model based on multiple observations per species. In such cases, standard approach stemming from quantitative genetics is to fit a multi-level model with separate varying-intercept (‘random’) effects for phylogeny and species identity estimating phylogenetic and non-heritable species-specific components, respectively (e.g. chapter 7 in Garamszegi’s Modern Phylogenetic Comparative Methods, 2014). In brms notation, the model would be:
y_var ~ x_var + (1 | gr(species_id)) + (1 | gr(phylo), cov = A_mat)
where A_mat is is phylogenetic variance-covariance matrix based on Brownian motion model and species_id and phylo are vectors, both containing species names.
Instead of Brownian motion or Pagel’s lambda model, I would like to fit Ornstein-Uhlenbeck model as a Gaussian process with a following kernel function:
K_{i,j} = \eta^2 exp(-\frac{D_{i,j}}{\rho}) + \delta_{i,j} \sigma_{k}^2
where D_{i,j} is patristic (phylogenetic) distance between species i and species j provided in a separate matrix. This is not possible in brms, so I am trying to fit the model directly in Stan.
I have several questions:
- Is it a good idea to fit species_id as a varying-intercept effect alongside with the Gaussian OU process for phylogeny similarly to the first model above? My attempt in Stan looks as follows:
functions{
matrix cov_GP_OU(matrix x, real eta, real rho, real sigma_k) {
int N = dims(x)[1];
matrix[N, N] K;
for (i in 1:(N - 1)) {
K[i, i] = square( eta) + square( sigma_k );
for (j in (i + 1):N) {
K[i, j] = square( eta ) * exp( -x[i, j] / rho );
K[j, i] = K[i,j];
}
}
K[N, N] = square( eta) + square( sigma_k );
return K;
}
}
data {
int<lower=1> N;
int<lower=1> N_spp;
vector[N] y_var;
vector[N] x_var;
int<lower=1> spec_id[N];
matrix[N_spp, N_spp] Amat;
}
parameters {
vector[N_spp] z_S;
vector[N_spp] z_a_spec;
real a_bar;
real b1;
real<lower=0> sigma_a_spec;
real<lower=0> eta;
real<lower=0> rho;
real<lower=0> sigma_k;
real<lower=0> res_sigma;
}
model {
matrix[N_spp,N_spp] SIGMA = cov_GP_OU( Amat , eta , rho , sigma_k );
matrix[N_spp,N_spp] L_SIGMA = cholesky_decompose( SIGMA );
vector[N_spp] a_phylo = L_SIGMA * z_S;
vector[N] mu;
res_sigma ~ exponential( 1 );
rho ~ inv_gamma(4.6, 22.1);
eta ~ normal( 1 , 1 );
sigma_k ~ exponential( 1 );
sigma_a_spec ~ exponential( 1 );
b1 ~ normal( 0 , 2 );
a_bar ~ normal( 12 , 5 );
z_a_spec ~ normal( 0 , 1 );
z_S ~ normal( 0 , 1 );
for ( i in 1:N ) {
mu[i] = a_bar + z_a_spec[spec_id[i]] * sigma_a_spec + a_phylo[spec_id[i]] + b1 * x_var[i];
}
y_var ~ normal( mu , res_sigma );
}
-
Is it possible to partition variance explained by Gaussian process and by species_id? Which parameter gives the variance explained by GP? Is it \sigma_{k}^2? Or is it \eta^2 + \sigma_{k}^2, which gives the values in the diagonal of the K variance-covariance matrix (i.e. where phylogenetic distance equals zero)? What is actually the interpretation of \eta in the diagonal of the K matrix?
-
The same questions relate to the extended model, in which I plan to add another GP with spatial distance matrix and quadratic exponential kernel. The distances are calculated between the centres of quadrats of a spatial grid. Could grid_id be added alongside with this GP, similarly to the simultaneous fit of species_id and the phylogenetic OU GP presented above?
Any help or comments would be greatly appreciated.