Phylogenetic time of event model

Dear stan/brms users,

I am a new modeler (having only been learning about brms for about a week), so please forgive my ignorance of many aspects of the question I’d like to ask. I work in the field of phylogenetic systematics, and I am trying to come up with a way to model the probability of a set of events through time, and assess the likely time interval in which the events occurred. The data are phylogenetically structured and a mix of contemporaneous and non-contemporaneous (i.e. not all of the tips of the tree line up at t=0).

The outcome response data is coded as binary 0/1 (the occurrence of an event, or the absence of evidence of an event), and are associated with each terminal in the phylogenetic tree, which has branch lengths in units of time. It appears that these “independent” (though phylogenetically structured) events are clustered in time, and I would like to assess that hypothesis quantitatively. I will refer to the event of interest as the “outcome event” because this is how I have been treating it, but this may be wrong (see below).

I reviewed the phylogenetic brms vignette Estimating Phylogenetic Multilevel Models with brms and this gives me some hope that brms/stan may provide a suitable way to explore these topics. I have been experimenting and going through some vignettes to become generally familiar with how brms works (though I am still very much a beginner!!). I have been able to get something akin to phylogenetic logistic regression (but bayesian, which is extremely interesting and potentially important its own right) to work, which I think looks something like this:

brm(uncex.merged ~ log.stem.ages_since + (1|gr(phylo, cov = A)), data2 = list(A = A), data = logisticreg.newdat.LHTs.bm, family = bernoulli(), cores=3, control = list(adapt_delta = 0.99))

In this simple model, I have coded the responses variable uncex.merged as 0/1 (as noted above). For the independent variable stem.ages_since I have coded the time elapsed since a particular geological event has occurred. This model allows me to assess the probability of the outcome event as a function of the distance (time) to a geological event of interest. In my case, the probability of the event increases toward the time of the geological event, which is pretty cool.

This is useful and interesting, but not exactly what I’m looking for, because it assumes knowledge of a particular geological event that may be related to the outcome event I’m trying to model, and it may be better to assess that hypothesis naively.

I am not sure if I should be modeling the outcome event as a function of time, or time as some function of the outcome event. Given that I want to assess WHEN the event(s) occurred (and associated uncertainty, given the input data), perhaps I am thinking about this backwards… I did find some information on performing survival analysis in BRMS, but I am not sure that is appropriate in this case (though I admit I am not familiar with that kind of analysis at all).

Either way, I would very much appreciate any guidance or insight into what questions I may need to answer in order to better articulate this hypothesis as a brms model. It would be great to be able to assess both 1) when the events of interest occurred, and 2) their probability of occurrence through time. Alternatively, because I am interested in the proximity of the events of interest to a known geological event, perhaps another way to ask this question would be, what is the posterior probability that the event of interest occurred within some proximity of the known geological event – but these seem like two different models. What I have outlined above is of primary interest, but it would also be great to be able to assess these hypotheses with respect to additional covariates I have not mentioned.

Thanks for your time,
Jacob Berv

2 Likes

Hi Jacob, welcome to discourse!

I’m having a bit of trouble understanding exactly what the modeling problem is here, and if you’re able to provide a bit more detail about the biology involved I bet it would help you get a useful answer. Here are a few thoughts:

  • There are two things that you’re calling “events” here. One is some geological event in deep time. The other, which you call an “outcome event”, is what exactly? Is it the presence of some discrete phenotype in a lineage? Is it an evolutionary event that you can actually localize to a particular time in the fossil record? I think the application of logistic phylogenetic regression to your problem is at least somewhat suspect in either of the cases that I mention above.
    In the former case, I’m skeptical that Brownian motion of the logit-probability of a discrete phenotype is a good evolutionary model. In particular, I’d expect the logit-probability of a phenotype to exhibit big jumps (essentially between negative infinity and infinity), corresponding to the synapomorphies on the tree.
    In the latter case, you are modeling the timings of discrete events that are localized on the evolutionary tree, and these might plausibly be modeled as independent. That is, you aren’t modeling the present-day phenotypes (for which you need a model for phylogenetic covariance); you are modeling the emergences of the synapomorphies themselves, which are assumed to arise independently in independently evolving lineages.
  • It sounds like your ultimate goal is to infer the times of the geological events that influence the “outcome events”. It seems to me that this might be extremely challenging. Suppose that the logit-probability of an outcome event is given by some linear function of time-since-event, as you’ve written in your model above. Then we can always assume the event occurred earlier or later and adjust the intercept accordingly. It seems to me that the only hope for inferring the time of the event is if there is some predictable non-linear change in probability associated with the event. For example, if you have some phenotypes measured in deep time from the fossil record, then you might observe that the rate of new synapomorphies increases substantially at some particular moment in geologic time, which could correspond to an event of interest. Note that detecting this requires that you have some (i.e. many) measurements of “outcome events” prior to the geological event of interest.

Cheers,
Jacob

Hi Jacob, thanks so much for your response!

This may be one way forward. Thinking of the rate of occurrence of the event of interest, and then trying to determine when that rate changes? The “event” I’m trying to model is not a phenotypic character (I was trying to use general language so to increase the likelihood that someone more familiar with modeling might be able to help). But, to be more specific, the “event” is actually coded to represent the detection of a shift in models of character evolution (as determined a priori). In order to model this as a logistic regression (ie, the presence or absence of a shift), I grafted negligible length terminals to each internal node in the tree – which allows me to account for the vcv and code “tip” data for internal nodes, as well as extant terminals (this works with phylolm in R). This effectively turns a tree with only contemporary terminals, into a tree with non-contemporaneous terminals as well (to which I can assign a state of 0 or 1). Some other folks I’ve spoken to who are knowledgeable about PCMs have indicated to me that this is at least in principle a valid way to do this.

Nevertheless, your idea of identifying when the rate of occurrence of the “event” (aka, model shift) changes sounds promising. In my case, I have measurements of the outcome event for every node (ie. internal tip of length zero) in the tree. Might you have a more specific idea of how to do that? I generally agree a brownian assumption may be insufficient for phenotypes – but even if I were modeling a phenotype, it would be cool to see if building a model like this was possible under that assumption too.

1 Like

This sounds really cool! Just to make sure I have this right, you are tagging each node with whether you infer (in a different, upstream modeling step) a change in rates of character evolution along the stem edge.

If you tag the internal nodes with binary states that indicate whether there is a shift in rates of character evolution, then modeling this with phylogenetic regression implies that you think that the rate of shifts in rates of character evolution is itself phylogenetically conserved. That is, given that a lineage displays a shift in rate, further shifts in rate are therefore seen as more likely in that lineage. You know a lot more about this stuff than I do, but is that really your a priori expectation? Naively (and I have no paleontological background) I would consider a Pagel’s lambda model here.

One subtlety that arises is that I imagine that it’s quite hard (or impossible) to distinguish a single shift in rate from multiple successive shifts in rate at closely spaced edges along the tree. If your estimation procedure for the rate shifts penalizes multiple successive shifts, so that rate-shifts are strongly favored to be sparse, then I think the following will not apply. But if not… any time that a single rate-shift is mis-categorized as multiple shifts in quick succession, that will give the spurious appearance of phylogenetic signal in the process, since you’ll get multiple closely related internal nodes getting tagged with shifts. Just something to think about.

A final thought is that presumably you have information not only about the presence of rate shifts, but also about their magnitudes and signs. I wonder if you might squeeze more information out of treating these rate shifts as continuous rather than binary?

Essentially yes – however the shifts I am detecting upstream are not necessarily shifts in the rate of character evolution – they are shifts in the model of character evolution – see this paper for more detail https://www.pnas.org/content/116/34/16921 – it is not exactly this, but based on this.

This is an interesting point – and no – I would not presume that further shifts in rate would be more or less likely in that lineage. But this is unknown to be honest.

Could you elaborate?

The upstream method (not published) does not seem to be affected by this – it uses information criteria to compare models with and without shifts at various points in the tree and then optimizes a best “configuration of shifts.” In my case, the shifts I’m identifying are occuring in different parts of the tree, but apparently coincident in time. I guess I’d like to now assess the probability that these shifts coinciding in time is a coincidence… – hence the idea of trying to model the probability of the shift events through time… I’m not sure if I’m articulating this properly, but clearly there may be many different angles to attack this. I can clearly see that these shifts are phylogenetically independent and coincident in time (within some margin of uncertainty), but this is qualitative. (there actually are some successive shifts because the model is polarized, but this is getting too in the weeds I think).

There are some parameters that are shared across the model shifts – however it wouldn’t be accurate to call them “rate” parameters in my particular case – but rather parameters that describe the equilibrium composition of the character matrix. I agree this may be interesting to try to look at. Nevertheless I still think it would be worth trying to see if there was some way to assess what I’ve outlined here (it is at least conceptually simpler at this point) – I’ve attached a bad sketch of what I had been originally imagiend trying to do with BRMS – maybe this is a bit clearer? 0s and 1s indicate the model shifts (hypothetically) – and then the sketch above is trying to illustrate the question of “when” the cluster of shift events has occurred. Perhaps one could then imagine estimating the 95% HPD interval describing the “coincidence” of shifts in time? (it also happens that these shifts are near an important geological event, and I’d like to know if that HPD interval spans the known event).

1 Like

Update – I (to the greatest astonishment of myself) think I have figured out how to do one aspect of this and it was much more straightforward than I thought. It would be helpful if anyone might be able to confirm that this does what I think it does:

So – If I do something like this:

time_test<-brm(stem.ages ~ uncex.merged + (1|gr(phylo, cov = A)), data2 = list(A = A), data=logisticreg.newdat.LHTs.bm.survival, family=gaussian(), control = list(adapt_delta = 0.99), cores=3, chains=3, iter=2000)

where stem.ages are the node ages
uncex.merged is the “outcome event” coded as 01 (but now as a factor)

This seems to generate two posterior probability distributions for each state, reflecting the average node ages associated with each state. So – this essentially seems like a bayesian phylogenetic ANOVA (super cool if true!)? In this case, I can examine the relative node ages associated with each state, to test one of the hypotheses I outline above.

This actually does seem to answer one of my primary questions. I am not sure if the gaussian() family assumption is most appropriate. Weibull also seems to work but I am not sure how to interpret phylogenetic signal from that model (at least, based on the phylogenetic vignette). Is there an easy way to perform model selection on different assumptions of the distribution family? (I suppose in bayes land maybe posterior predictive checks may be relevant…?)

Also – am I correct in assuming that accounting for phylogenetic signal in the data in this way is equivalent to applying a lambda model? Or does this assume a pure brownian motion model? Is it possible to apply alternative assumptions like OU/EB etc or would that require some new functions? I realize BRMS may not have some of this stuff baked in but this seems like a great bayesian alternative to some traditional PCM methods.

Thanks for your help!

J

I don’t think this can be what you want. The stem ages of nodes don’t evolve via Brownian motion on a phylogeny, or by anything approximately similar to Brownian motion on a phylogeny. In fact, since the stem ages are exactly specified by the phylogeny itself, if you really “control for phylogeny” there would be no residual variation in stem ages whatsoever.

1 Like

Yea-- I had this thought as well, and wondered if there may be some circularity. Nevertheless, as a general framework, is it correct that what I have outlined above would represent a bayesian phylogenetic ANOVA? (all else being equal).

In any case, this produces results that seem very plausible. A brownian assumption for the splitting times would imply that I think patterns of speciation proceed as a random walk, right?-- this seems like a reasonable null to me – no? When I run this with a gaussian() assumption I get an estimate of lambda ~0.5 range. Which implies that there is some phylogenetic signal in the distribution of splitting times associated with the state of interest (which also seems plausible). I suppose I can run this with and without the vcv information to see how that affects the output.

J

This is indeed a phylogenetic anova!

Either I am misunderstanding what you mean by stem.ages in your model, or else my answer is no. Stem ages here are just the stem ages of the clades whose last common ancestor is given by a focal node, right? It is literally impossible for these to evolve via brownian motion; a “brownian assumption” doesn’t make sense. By definition, these stem ages evolve by linear motion: positive in forward time and negative in backward time. The difference in stem ages between two nodes is obtained simply by starting from the base of one of those nodes, taking the (negative) sum all the branch length back to the LCA, and then adding the (positive) sum of the all the branch length out to the other node. The brownian motion assumption would imply that two extant tips on the tree should have less similar stem ages than either tip has with its last common ancestor.

If you apply the brownian motion model to these data, you should certainly see phylogenetic signal, since closely related internal nodes will tend to have similar stem ages. Note that if one node is adjacent to the other (i.e. an immediate descendant), then the difference between their stem ages will just be the length of the stem of the basal node. Since this length will tend to be small compared to the average difference in the stem ages of pairs of arbitrary nodes on the tree, you will certainly see some phylogenetic signal here. The problem is that the phylogenetic signal is actually absolutely perfect, and the fact that you’re not estimating it to be perfect reflects the non-correspondence between the Brownian model and the linear-in-forward-time model which is the true model here by definition.

Cool – and awesome that it’s this straightforward to do such a comparison in general. Definitely getting my more hyped about BRMS. I think the phylogenetic anova in phytools and geiger both assume brownian motion, so doing such an analysis in BRMS with lambda assumption may actually be more flexible than the off the shelf ML implementations. Though I suppose one could always do a pgls with different model assumptions.

Your explanation about stem ages makes sense I think. In my case, the “stem” ages are the ages of the parent node for a given focal node. I think that is what you described.

Cool. Is there a way to put a prior on phylogenetic signal to constrain it to lambda=1

1 Like

It’s certainly possible to estimate the brownian model without lambda, but in this application what you’d get out at the end is just a worse-fitting version of the (already severely misspecified) brownian model.

hi @jakeberv,

This is an interesting conversation! I quite curious about this problem you are trying to solve.

Like @jsocolar said, it is. This is one of my favorite tools! For me, this is an intuitive way to do this type of analysis - categorical predictor variables, a continuous response, and whatever additional predictors, group level effects, or correlations the data may have (including a vcv matrix). The easiest way to use these estimates in hypothesis testing is to calculate the contrasts or differences; subtract one distribution from another and see how great the differences are. It helps me to remove the intercept in the model so the estimates of the two categories are directly presented. Also, it is probably a good assumption to assume that each predictor has it’s own variance rather than the same variance across predictors. You could try:
time_test<-brm(bf(stem.ages ~ 0 + uncex.merged + (1|gr(phylo, cov = A), simga ~ 0 + uncex.merged)), data2 = list(A = A), data=logisticreg.newdat.LHTs.bm.survival, family=gaussian())
This will give you the event estimates and their sigmas directly rather than using one as a dummy variable / intercept.

Glad to see more PCM folks converting to brms :) It’s an absolutely phenomenal tool

According to Paul and the brms github, additional Gaussian Process models have been in the development queue for a while, including OU gp’s. We’re all hoping they get implemented soon! In the meantime you can run this same model with an OU gaussian process in McElreath’s rethinking package. Statistical Rethinking v2 has a large portion dedicated to phylogenetic covariance and how to implement a strict Brownian (not exactly what brms does, see here!) and OU models.

happy modeling!

Super neat @jonnations. I am excited about these possibilities in BRMS. We could definitely use more vignettes on how to do basic PCM analyses with BRMS – that would be a hugely valuable resource for the field. Once I get more familiar with this kind of modeling, I may try to put some stuff together… bayesian versions of anova and logistic regression are already fascinating to think about in this context (at my currently quite superficial level).

@jonnations

it looks like there is a misplaced comma or parentheses here (getting an error)-- could you double check this for me? not sure how to fix it (don’t really understand exactly what this is doing)

I can’t actually execute the code without the data, but there is a typo where simga is written instead of sigma. Pretty sure that’s the only typo.

I’m getting

> time_test<-brm(bf(stem.ages ~ 0 + uncex.merged + (1|gr(phylo, cov = A), sigma ~ 0 + uncex.merged)), data2 = list(A = A), data=logisticreg.newdat.LHTs.bm.survival, family=gaussian())
Error: unexpected ',' in "time_test<-brm(bf(stem.ages ~ 0 + uncex.merged + (1|gr(phylo, cov = A),"

oh yea, you need to put an extra close parenthesis after cov = A) before the comma, and then lose one of the close parentheses after 0 + uncex.merged

Oops , sorry for the spelling error and the lack of explanation.

First @jakeberv the reason you’re getting the error is that there should be an extra ) after cov = A).

Try this:
formula = bf(stem.ages ~ 0 + uncex.merged + (1|gr(phylo, cov = A)), sigma ~ 0 + uncex.merged)

time_test<-brm(formula = formula, data2 = list(A = A), data=logisticreg.newdat.LHTs.bm.survival, family=gaussian())

the bf command sets up a formula object with multiple formulas, you can read about it in the package docs.

These were just some suggestions to make the phylo Anova easier to interpret and process.

What this is doing is two things: first, it’s removing the intercept from your model (the 0 + part). You’ll notice in the summary that your uncex parameter will be listed as b_uncex_1 and b_uncex_2 or something similar rather than an intercept value and the b_uncex_2 value. This is easier for post processing as they all have the correct name. Second, the second part of the formula is saying estimate a separate sigma value for each value of uncex. I’m assuming that it doesn’t make sense to model the same sigma value for both of your uncex values. This is like an unequal variances t-test rather than an variances t-test. See this M. Vourre blog post Otherwise you are basically running the same model you did before.

definitely :)

1 Like

Many thanks to both @jonnations and @jsocolar – I am experimenting this this now. It seems to be running but unlinking the variances has slowed the mcmc way down, at least by a factor of 10-20. I’m sure using the default priors is not helping the runtime. Anyways-- thanks again and I’ll update here.

@jonnations Thanks for your advice above. I’ve been playing with this and I’m getting some fascinating results. I wanted to ask about the general format of the formula you proposed for bayesian phylogenetic anova with unequal variances. You noted:

formula = bf(stem.ages ~ 0 + uncex.merged + (1|gr(phylo, cov = A)), sigma ~ 0 + uncex.merged)

If I wanted to add additional co-variates , how would I modify this formula? would that look something like this:

formula = bf(Y ~ 0 + X + Z + (1|gr(phylo, cov = A)), sigma ~ 0 + X + Z)