Covariance matrix phylogenetic models

Hi!

I recently noticed that the vignette for phylogenetic models now advises to use the covariance matrix, whereas an older version of the vignette used the correlation matrix to fit the phylogenetic models. In the specific example used, it doesn’t really matter if we would use the covariance matrix or the correlation matrix, because they are almost identical in this specific tree, as demonstrated if you run the code below. I think this is related to the branch lengths.

# Load example tree
phylo <- read.nexus("https://raw.githubusercontent.com/MPCMEvolution/MPCMArchive/master/data_files/ch11/phylo.nex")

a1 <- ape::vcv(phylo, corr = TRUE) # vcv matrix
a2 <- ape::vcv(phylo, corr = FALSE) # correlation matrix

ggplot() +
  geom_abline(slope = 1, color = "blue") +
  geom_point(aes(a1, a2))

From my experience with other trees (where the branch lengths are higher and the covariance matrix is quite different from the correlation matrix), I tend to get a very low phylogenetic signal when I am using the covariance matrix, whereas I get substantially higher phylogenetic signals when using the correlation matrix.

My question is: it is justifiable to use the correlation matrix or to rescale the tree to maximum branch lengths of one? This seems like a crucial point that may impact conclusions of anyone fitting phylogenetic models, but I don’t quite understand why the differences are so big.

Thanks for any thoughts on this.

Below, a plot showing an example of the differences in phylogenetic signal when fitting a model with the covariance matrix of the unscaled tree (red), the covariance matrix of a scaled tree (blue), and the correlation matrix (yellow).
image

Below the code to reproduce this example.

##### Fish phylogenetics example #####

# Load packages
library(tidyverse)
library(fishtree)
library(rfishbase)
library(janitor)
library(brms)

# Get species list in the family Labridae 
sp <- species_list(Family = "Labridae")

# Get data on maximum size of fishes
length <- species(sp, "Length") %>%
  mutate(species = sp, phylo = gsub(" ","_", sp)) 

# Get phylogenetic tree
tree <- fishtree_phylogeny(sp)

# filter and clean data for missing species in tree
lengths <- length %>%
  dplyr::filter(species %in% gsub("_", " ", tree$tip.label)) %>%
  janitor::clean_names() %>%
  drop_na()

# get covariance matrix
tree_vcv <- ape::vcv(tree, corr = FALSE)

# get scaled covariance matrix
scaled_tree <- tree
scaled_tree$edge.length <- scaled_tree$edge.length / (max(tree$edge.length))
tree_vcv_scaled <- ape::vcv(scaled_tree, corr = FALSE)

# get correlation matrix
tree_cor <- ape::vcv(tree, corr = TRUE)

# fit models
fit1 <- brms::brm(length ~  (1 | gr(phylo, cov = tree_vcv)) , 
                  data = lengths, family = gaussian(),
                  data2 = list(tree_vcv = tree_vcv), 
                  chains = 4, cores = 4, iter = 4000,
                  warmup = 2000, control = list(adapt_delta = 0.99,
                                                max_treedepth = 12))

fit2 <- update(fit1, data2 = list(tree_vcv = tree_vcv_scaled), 
               newdata = lengths)

fit3 <- update(fit1, data2 = list(tree_vcv = tree_cor), 
               newdata = lengths)

# phylogenetic signals
hyp <- "sd_phylo__Intercept^2 / (sd_phylo__Intercept^2 + sigma^2) = 0"
h1 <- hypothesis(fit1, hyp, class = NULL)
h2 <- hypothesis(fit2, hyp, class = NULL)
h3 <- hypothesis(fit3, hyp, class = NULL)

#plot
ggplot() +
  geom_histogram(aes(x = h1$samples$H1, fill = "non_scaled"), 
                 fill = "red", alpha = 0.5) +
  geom_histogram(aes(x = h2$samples$H1, fill = "scaled"), alpha = 0.5) +
  geom_histogram(aes(x = h3$samples$H1, fill = "cor"), alpha = 0.5) +
  theme_bw() +
  scale_fill_manual(name = "", values = c(
                                          non_scaled = "red", 
                                          scaled = "blue", 
                                          cor = "yellow")) +
  labs(x = "Phylogenetic signal")

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
brms_2.13.5

3 Likes

This is really interesting (and a little concerning?). Thanks for sharing it!

I dug back into Chapter 11 of Modern Phylogenetic Comparative Methods, which I believe is one of @paul.buerkner 's sources for the phylogenetic models in brms. In section 11.1.2 the authors discuss the differences between the Phylogenetic GLS and the Phylogenetic GLM. The main difference is the lack of an error term in the pgls model, which is made up for by introducing a phylogenetic signal parameter. PGLM’s obviously contain this error parameter (sigma in brms). They state that:

The model described in Eqs. 11.1 and 11.2 is equivalent to Pagel’s lambda model of phylogenetic signal inference (Freckleton et al. 2002; Housworth et al. 2004), given that the matrix R is a correlation matrix (i.e. diagonal elements are equal to 1, Hansen and Orzack 2005; Hadfield and Nakagawa 2010). Indeed, very much alike the heritability for QG analysis, we can define Lynch’s phylogenetic heritability as lambda = sd_phylo__Intercept^2 / (sd_phylo__Intercept^2 + sigma^2) as a measure of the phylogenetic signal.

I changed the equation in the text to brms code.

So here they are saying that the correlation matrix must be used to correctly estimate the lambda value, not the covariance matrix . So it looks like your fit2 and fit3 results are the correct estimates of your phylogenetic signal.

(Side note: this confuses me a bit more because I remember always using an inverse covariance matrix in MCMCglmm, the program discussed by the author).

But should the correlation matrix be used for all phylogenetic models in brms? Dunno. I ran your models and it looks like the estimates of the species’ means and the sigma values are basically identical across models, which is very encouraging. Really the only difference is the scale of the phylo_sd_intercept. Seeing that the sigma values of the length variable are so high, it kinda makes intuitive sense that the phylo_sd_intercept would also be a large value.

I hope others can chime in!

3 Likes

You want to use correlation matrix here not covariance matrix as you want to estimate the phylogenetic variance. There are many different ways to scale a covariance matrix into a correlation matrix (which may match different model of evolution you are assuming). If you use a correlation matrix and you get a phylogenetic variance, which you want to use to get Pagel’s lamda or phylogenetic heritability. You may want to have a look at our recent work - https://ecoevorxiv.org/su4zv/ - see the section “Including and Testing the Phylogenetic Effect” in Discussion

3 Likes

Thank you very much @itchyshin!

I compared the phylogenetic signal calculated through brms using the correlation matrix with MCMCglmm, and we get similar results indeed. code below.
@paul.buerkner do you concur with this?
image

##### Fish phylogenetics example #####

# Load packages
library(tidyverse)
library(fishtree)
library(rfishbase)
library(janitor)
library(brms)
library(ape)
library(MCMCglmm)


# Get species list in the family Labridae 
sp <- species_list(Family = "Labridae")

# Get data on maximum size of fishes
length <- species(sp, "Length") %>%
  mutate(species = sp, phylo = gsub(" ","_", sp)) 

# Get phylogenetic tree
tree <- fishtree_phylogeny(sp)

# filter and clean data for missing species in tree
lengths <- length %>%
  dplyr::filter(species %in% gsub("_", " ", tree$tip.label)) %>%
  janitor::clean_names() %>%
  drop_na() 

# get correlation matrix
tree_cor <- ape::vcv(tree, corr = TRUE)

fit3 <- brms::brm(length ~  (1 | gr(phylo, cov = tree_vcv)) , 
                  data = lengths, family = gaussian(),
                  data2 = list(tree_vcv = tree_cor), 
                  chains = 4, cores = 4, iter = 4000,
                  warmup = 2000, control = list(adapt_delta = 0.999,
                                                max_treedepth = 20))

# phylogenetic signal
hyp <- "sd_phylo__Intercept^2 / (sd_phylo__Intercept^2 + sigma^2) = 0"

h3 <- hypothesis(fit3, hyp, class = NULL)

##### MCMCglmm #####

# get relatedness matrix 
tree2 <- tree %>% ape::chronoMPL() 
inv.phylo <- MCMCglmm::inverseA(tree2, nodes = "TIPS", scale = TRUE)

# convert data to dataframe (MCMCglmm is not compatible with tibbles as input)
lengths <- as.data.frame(lengths)

fit_glmm <- MCMCglmm(
  fixed = length ~ 1,
  random = ~ phylo,
  family = "gaussian",
  ginverse = list(phylo = inv.phylo$Ainv),
  prior = prior,
  data = lengths,
  nitt = 5000000,
  burnin = 1000,
  thin = 500)

# phylo signal
lambda <- fit_glmm$VCV[,'phylo']/
  (fit_glmm$VCV[,'phylo']+fit_glmm$VCV[,'units'])
summary(lambda)

ggplot() +
  geom_histogram(aes(x = h3$samples$H1, fill = "brms_cor"), alpha = 0.5) +
  geom_histogram(aes(x = lambda, fill = "MCMCglmm"), alpha = 0.3) +
  theme_bw() +
  scale_fill_manual(name = "", values = c(
                                          MCMCglmm = "blue", 
                                          brms_cor = "yellow")) +
  labs(x = "Phylogenetic signal")
1 Like