Brms: Phylogenetic model with individual level phylogeny and group and individual level variables?

Hi all,

I have a data set with a response variable H measured for each individual (“Ind”; i.e., a plant), as well as an ultrametric phylogeny with Inds as tips. Each Ind, however, belongs to one of about 40 populations ("Pop"s; which are monophyletic), and many of our predictors of interest are measured at the Pop level (environmental variables).

We also have several predictors measured at the Ind level (plant height, etc.), and to make matters even more complicated, these are missing for a variable number of Inds per Pop (missing at random). Note that the response variable H has no missing data.

Using brms (or other suitable software), I’d like to model H as a function of both Ind- and Pop-level variables while accounting for phylogenetic covariance and without pseudoreplicating the environmental variables.

One option I’ve tried is to take the average for each Pop but incorporate error using se() and me() for the response and predictor variables, respectively (the latter only for the predictors measured at the Ind level), using a phylogeny collapsed to the Pop level. However, this seems like it might be a waste of a good phylogeny (as there might be phylogenetic structure within and not just between Pops), but I also get the feeling that measurement “error” could be more elegantly incorporated into the model, especially given the variable sampling of Ind-level variables within each Pop.

I think this is similar to the example given in the brms phylogenetic models vignette under “A Phylogenetic Model with Repeated Measurements”, except that we have a phylogeny at the Ind level rather than the species level. A bit like having subspecies…

Here’s a snippet of the data.
example_data_mc_stan.csv (11.8 KB)

Does this formulation seem reasonable?

Ci<-vcv.phylo(Ind_tree,corr=TRUE)
Cp<-vcv.phylo(Pop_tree,corr=TRUE)
bf(H ~ mi(IndVar) + PopVar + (1 | gr(Ind,cov=Ci) ) + (1 | gr(Pop,cov=Cp) ) ) +
   bf(IndVar | mi() ~ 1) +
   set_rescor(F)

Thoughts, opinions, questions all welcome!

Cheers,
S

Hi,

If your data are in long format where each individual is a row, it should be straightforward to have a column for each covariate where the values of covariates that are shared across individuals (species-level) and species (population-level) are simply repeated. You can then still had phylogenetic information in the form of a Gaussian process or something. As for the missing values, I have yet to deal with this in Stan, but I think brms has capacity to do multiple fits with missing values imputed by the mice package, or perhaps even better, you could specify submodels for the missing data in Stan. For instance, you could assume your predictors come from a normal distribution, estimate the mu and sigma, and then generate random draws from that distribution for the missing values. Have I understood your queries correctly?

Hi,

Can you say a bit more about your Ci and Cp, and why you are trying to include them both? Are you mostly interested in the fixed effects, or also in the relative contribution of those terms? Wouldn’t a “phylogeny” (I think perhaps better called a pedigree?) of all your individuals already include the relationships between your populations? In other words, what information is present in Cp that is not present in Ci?

Hi both,

Thanks for your replies.

@ mhollanders, yes the data are in long format and values are simply repeated in columns when they’re Pop-level. This is shown in the CSV. Regarding missing values, you say “you could specify submodels for the missing data in Stan”; is this not what the second bf() element bf(IndVar | mi() ~ 1) does? I’ve also looked into imputing missing values using the phylogeny with Rphylopars. I don’t suppose there’s an equivalent way to do this with Stan?

@ Ax3man, I’m by no means hellbent on including both Cp and Ci in the model. My reasoning for including both the individual and population phylogenies/pedigrees is that some variables are measured at the individual level and others are measured at the population level. But I’m almost certainly revealing my lack of knowledge here. My intuition is that I need to somehow specify that variation between individuals within a population is structured by the relationships between individuals within that population, while variation between populations is structured by the phylogenetic relationships between the populations.

In the brms vignette (Estimating Phylogenetic Multilevel Models with brms) the example with repeated measures only shows how to account for non-independence of observations within species when only the species-level phylogeny is available — but if you knew how closely related the individuals were within each species, wouldn’t you want to include that information in the model?

I hope that helps clarify my thinking.

Cheers,
Seth

Not necessarily, I think. If we consider the branch lengths of the tree across all individuals to measure their shared evolutionary history, then the between-species branches are typically so much longer than the within-species branches that they absolutely dominate the signal. Species may trace back a 1,000,000 generations, while individuals trace back only a fraction of that. So the co-variance due to shared evolutionary history will be almost completely determined by the between-species differences. To what extent this is the case can be seen from the relative branch lengths, if the within-population branch lengths are very small, then the model becomes the indistinguishable from one with just the species tree.

Fitting separate effects for your phylogeny and pedigrees comes with two issues then. 1) The conceptual issue is whether there is a meaningful distinction in what the phylogeny and pedigree are representing. Do they have different mechanisms in how they affect your trait of interest? 2) A methodological issue is how you are going to represent these two trees. If the phylogeny is simply a collapsed version of the pedigree, then the covariance matrices are likely to strongly co-vary, and it will be very hard to distinguish between them.

I would start by figuring out what your generative model of the trait of interest is. What factors may be important, and in what way? For example, if your populations have different ecologies, which may not trace their evolutionary history, you may want to include a random intercept per population, without covariance structure.

I think whether your explanatory variables are measured on the individual or population level is perhaps not very relevant as to what approach you should use for the tree. As Matthijs says, you just repeat the population level variable for each individual in that population.

I’ve never tried this, but perhaps one can use bf(IndVar | mi() ~ 1 + (1 | gr(species, cov = phylo))? (It might be very slow.)

1 Like

Hi,

Just to chime in here:

@Ax3man 's suggestion of imputing missing data using the phylogeny is spot on. I’ve done this many times, and it works great! Don’t bother trying to impute data in Rphylopars. I’ve also done many simulation tests of these methods in my own work just to make sure they are coming up with accurate distributions for the missing values that still reflect a reasonable amount of intraspecific error. Typically, depending on the shape of your tree, the missing value estimates are far better when including the phylogeny as a predictor. If you have more predictors that are not missing data, then use those too. For example, if you have complete data for some height and width variables, then you can do this: bf(IndVar | mi() ~ 1 + height + width + (1 | gr(species, cov = phylo)) . This will improve the estimates even more. The one catch is when you have some distantly related outgroups in your phylogeny. In that case, the missing value estimates can have a ton of uncertainty. In one example from my work, I used a mammal phylogeny that included a platypus along with a bunch of placental mammals. The missing data estimates for the platypus had a tremendous amount of uncertainty just because they are so distant from other extant mammals. But adding other predictor variables to the imputation equation helped this. Also, it can be a little slow, but using cmdstanr speeds things up a lot. The whole process is orders of magnitude faster than it was a few years ago.

Back to your data: Since the Ind’s are monophyletic and nested in the Pop’s then I don’t think you need to use two phylogenies here. @Ax3man 's comment above is correct in that the phy correlation matrix will include information about the whole structure. I would avoid taking means or anything like that. One option, if you want to model the influence of the Pop variable as it’s own group level effect, you could include a name/factor for each pop, and add that. I’ve done this where I labeled larger clades to try to identify whether processes were happening at deeper nodes, but without the structure. I hope this addresses your question:

but if you knew how closely related the individuals were within each species, wouldn’t you want to include that information in the model?

Let me know if you have other questions! These types of missing data / phylo/ multiple taxonomic level analyses have their own idiosyncrasies. Cheers

1 Like

Thanks all for your detailed and very helpul responses! @jonnations I unfortunately don’t have any morphological data measured for all species. I plan to now formulate the model like so:

Ci<-vcv.phylo(Ind_tree,corr=TRUE)
Cp<-vcv.phylo(Pop_tree,corr=TRUE)
bf(H ~ mi(IndVar) + PopVar + (1 | gr(Pop,cov=Cp) ) ) +
   bf(IndVar | mi() ~ 1 + (1 | gr(Ind,cov=Ci) ) ) +
   set_rescor(F)

That is, accounting for population level phylogenetic covariance when modeling H as a function of the environmental and morphological covariates, and using the individual-level phylogenetic covariance to help impute missing morphological data (there is strong phylogenetic signal there).

Is there anything else I should be worrying about before proceeding? (Sorry, very vague question but I figured why not…)

Cheers,
Seth

I think you’d want to add a (1 | Pop) to your first formula to model population differences that don’t follow the phylogeny. This is equivalent to the case in the phylogenetics vignette of having multiple observations per species.

@Ax3man So this is an interesting question that has been presented to me by both colleagues and reviewers. I’ve always meant to ask about it on this forum. Thanks for bringing it up in this thread! I think it’s relevant here, so I kept it within this original post.

I will use the same variables from the brms phylogenetics vignette so that folks can follow along.

In the phylogenetics vignette, the point of including species in the
phen ~ spec_mean_cf + (1 | species) + (1| gr(phylo), cov = A)
model is to quantify the influence of the variation in the species (intraspecific variation) versus shared covariance due to the structure of the phylogenetic correlation matrix. From the model above, you can compare the sd(species) and sd(phylo) outputs and make a statement about the influence each has on the model. This is a common focus of quantitative genetics studies, like the book chapter that Paul got the data from for use in the vignette. Those authors typically look at quite closely related species or populations.

However, many users of these methods work on species that are much more distantly related to one another, and our questions don’t (or can’t) focus on such small levels of variation. If you are not interested in this specific question of comparing the influences of these 2 sources of variation, and instead are only concerned with making sure that the shared ancestry is considered in the model (using the (1| gr(phylo), cov = A) term), then wouldn’t the “intraspecific” variation in the data (that is, the variation between multiple cofactor's that have the same phylo value) be considered in the regular phen ~ cofactor + (1| gr(phylo), cov = A) model? It’s my understanding that for each phylo value, a distribution in calculated, so wouldn’t it just be that the phylo distribution is calculated using multiple values rather than a single one?

I had one reviewer really push that perspective on me. It makes sense, but I’m still not sure which is better practice, or if it depends strictly on the question. I hope this makes sense.

As a side note, in my own work with many different datasets and species groups, I’ve observed that using phen ~ cofactor + (1| gr(phylo), cov = A) and phen ~ cofactor + (1|species) + (1| gr(phylo), cov = A) result in basically identical population-level parameter estimates. The only differences between the 2 models are the sd(phylo) values, which differ depending on whether the second group-level term is used or not.

Do you @Ax3man or anyone else have thoughts on this? Thanks!

In my experience this approach is common when evolutionary biologists perform phylogenetic comparative analyses with multiple observations per species, regardless of whether they are closely related. Interpreting the variances can definitely be of interest, but another common argument is that the (1 | species) effect is necessary to avoid pseudo-replication. Observations of the same species are not independent, and not just due to the phylogeny.

Imagine, for example, a trait which evolves very rapidly compared to the long branches in the tree. For such a trait, the phylogenetic signal will be low. Yet, the within-species variance may still be much smaller than the between species variance. If we only include a phylogenetic effect for species, this hierarchical structure in the data is not accounted for. As the variance captured by the phylogenetic term will be small, we would effectively treat all our observations as independent.

Or look the question at hand, where we care about differences between populations. Populations are separated in space, and experience different environments. These differences in environment may affect the trait of interest, and do not follow the phylogeny (although they may be correlated with the phylogeny). I would want to take these population differences into account, at least with random intercepts for population, and possibly random slopes for IndVar.

My understanding is that this is not a general result, but I’d be keen to hear why this would be the case. It would make sense when the phylogenetic heritability (Pagel’s lambda) is close to 1, since in that case the (1 | species) term will capture much less of the variance than the (1| gr(phylo), cov = A) term, and removing it will not matter that much. But if lambda is closer to 0 then I don’t expect this to be generally true.

2 Likes

Thanks for the great response @Ax3man! This is a very helpful way to think of it (and to explain it to reviewers :)

My understanding is that this is not a general result, but I’d be keen to hear why this would be the case. It would make sense when the phylogenetic heritability (Pagel’s lambda) is close to 1, since in that case the (1 | species) term will capture much less of the variance than the (1| gr(phylo), cov = A) term, and removing it will not matter that much. But if lambda is closer to 0 then I don’t expect this to be generally true.

I think you hit on it right there. Often I am working with relatively deep nodes of quite divergent taxa, rather than a pedigree within a single species or 2, and I’ve rarely, if ever, had a heritability/lambda value close to 0. It’s almost always over 0.5. So maybe that’s why the population-level estimates don’t differ. I will say that the pop-level estimates differ depending on whether the phylogeny is included or not. Using the phylogeny and the intraspecific variation, or just the phylogeny, the population-level estimates (not the group-level) are the same, but running a model just the intraspecific variation and not the phylogeny will result in different estimates, which makes sense to me, especially with high heritability scores.

I’ve also noticed that, the closer the heritability (Pagel’s lambda) score is to 1, the more trouble the model has parsing between the intraspecific variation sd(Species) and the sd(Phylo) values. Often the trace plots bounce between 2 similar values for each, like there are 2 values and the model doesn’t know which one each parameter should be. I read somewhere that mcmcGlmm uses an algorithm that’s better at determining the differences between these two parameters, but brms often seems mixed up when using both (but again, this problem never really affects pop-level estimates, and is typically kept in check with fairly regularizing priors).

I agree with much of what has been said, but I think there’s some useful additional perspective to contribute here. Remember that the brms approach to estimating a phylogenetic model is to assume that we can infer a covariance structure directly from the phylogeny. One of the commonest ways to do this is via a brownian motion assumption for trait evolution, and this is what is implemented by default in ape::vcv.phylo as used in the brms phylogenetic modeling vignettes.

If such an assumption is met, the intuition that

is not valid. “The pace of evolution” is just a question of the units in which the trait values are discussed. The covariances will be determined by how much of the branch length is terminal rather than internal, irrespective of the pace of trait evolution. The above intuition, rather, comes from our intuitive understanding that the brownian motion model is wrong. For example, in the case of stabilizing selection or a biophysically bounded range of plausible values, it is impossible for things to evolve rapidly via a brownian motion model over the long term.

One way to frame the whole question is whether and how to relax the assumption of trait evolution via brownian motion. If the trait values are stochastic and include some variance that is not under genetic control, we might want to add a random effect of individual (i.e. a residual). This is often blindingly apparent when we have multiple observations per species, because we can see between-individual variances that are completely inconsistent with the pace of evolution of the lineages’ mean values. If there’s some known environmental or geographic structure in the response, we might want to try to model that. The pagel’s lambda model, corresponding to fitting a random intercept per species, might be appropriate either if the pace of evolution has recently accellerated, or possibly if there is directional or stabilizing selection, such that the structure of the tree in deep time does not leave a long-lasting imprint on the trait covariances we observe today.

Just my 2 cents: all of these options are ways to hang extra variance components atop the brownian motion model. Your job as the modeler is to figure out what set of extra variance components transforms the unrealistic brownian model into something that does a close-enough job of capturing reality.

2 Likes