Sampling over many phylogenetic trees in brms

Hi Forum,

I am working on a manuscript that includes some phylogenetic multilevel models. Some co-authors suggested that I sample over a number of phylogenetic trees from the posterior of a MCMC analysis (BEAST in this case, but other Bayesian programs do the same thing). As the trees are not all the same, this captures the uncertainty of the phylogenetic estimation in the subsequent model.

Ideally I would run brms over multiple trees and then merge the results (using thinning or not) into a brmsfit object so that other functions such as pp_check() and marginal_effects() will work.

I saw the previous forum post HERE on doing the same thing in STAN, but I would ideally like to stick to brms for now. Additionally, after using several glmm programs that work with phylogenies, I find that brms is by far the easiest to use, has the best built-in functions, and the best support. I think that it will really catch on with the phylo-biologist crowd in the next year or two. As the multiple-tree method becomes more and more expected of authors, having a clear way to run this will benefit everyone.

I am a (novice) R user and not a programmer, so I wanted to ask for some advice on this. There are a couple of approaches that I think might work, but I’m sure there are much more elegant options.

-One way would be to loop over all the sampled trees. For 100 trees this would produce 100 brmsfit objects that would need to be merged (and probably thinned).

-Second thought would be to use one of R’s vector functions like sapply() or something to avoid the for-loop and perhaps gain some speed?

-I suppose another option would be a way to refit the previous model with a different tree using update() but somehow continuously updating and adding to the previous model

Any suggestions on how to proceed with this would be very helpful!

@Zacco @maxbiostat @dpascall I am tagging you three in here as well as you all have commented on phylogenetic methods in stan/brms before, and I thought you would have some interest in this topic, and perhaps some suggestions.

Thanks!

I am fitting pretty standard brms phylo models, nothing too complicated:

fit <- brm(
  State ~ PHENO + (1|phylo) + (1|Species), data = data, 
  family = cumulative(link = "probit"), cov_ranef = list(phylo = A),
  iter = 5000,
  save_all_pars = TRUE,
  prior = c(
  set_prior("normal(0,50)", class = "b"),
  set_prior("normal(0,50)", class = "Intercept"),
  set_prior("normal(0,1.5)", class = "sd", coef = "Intercept", group = "Species"),
  set_prior("normal(0,1.5)", class = "sd", coef = "Intercept", group = "phylo")),
  control = list(adapt_delta = 0.99, max_treedepth = 15),
  cores = parallel::detectCores(),
  file = "20181005_tail_fit.Rds"
)

A way to run this in parallel on a cluster could also save a lot of time (for 100 trees, i estimate a runtime of 25 hours on my machine).

Please also provide the following information in addition to your question:

  • Operating System: OSX
  • brms Version: 2.4.4
2 Likes

You can use the combine_models function of brms that allows you to merge the posterior samples of multiple brmsfit objects. Maybe that’s helpful to you.

I very much agree that getting this into brms would be good. Implementing it manually for the manuscript I’m working on was a real hassle, and I don’t really see many people being willing to jump through the algorithmic hoops to do it and get it actually running fast enough to be useable, especially when they could just throw a summary tree into a standard package and get answer in minutes.

Can you do this? I’m pretty sure (though I’d like someone who’s considerably better at probablity than me to confirm or refute this!) the posterior distribution conditioned on a series of parameters does not in general equal the average of posterior distributions conditioned on the parameters individually.

In any case, you need to loop over all of your phylogenetic trees and fit one model per tree to merge them afterward. 25 hours seems not that much of a runtime so that should be doable.

Can you elaborate on this a bit more? Not sure if I understand this statement.

I my original sentence reads in a way that’s much more certain than I intended it to be! But what I was trying to get at, and I would like someone to confirm or deny, is whether:

p(parameters | data, all N phylogenies) = the concatenation of (the draws from) p(parameters | data, phylogeny[i]) for i from 1:N

i.e. is the total posterior distribution the average of posterior distributions for each? I thought that it wasn’t, but it’s entirely likely I’m wrong about that. Either way, that’s the assumption you’re making when you run the model separately over each tree then concatenate them post-hoc.

E: And even if it is, presumably it assumes each individual model is mixing at exactly the same rate, so that a given number of draws of the posterior distribution for phylogeny[i] gives the same amount of independent information as a given number of draws of the posterior distribution for phylogeny[i+1]?

If all models are assumed equally likely(!), then I think we may very well merge all chains of all models into one single object. This is also the approach to multiple imputations done in brm_multiple.

Of course, the accuracy of that joint model depends on the accuracy of individual models. And samples of the joint model may be biased if the information provided in the samples of the individual models differs drastically. I haven’t seen a formal analysis of this though.

But what I was trying to get at, and I would like someone to confirm or deny, is whether:

p(parameters | data, all N phylogenies) = the concatenation of (the draws from) p(parameters | data, phylogeny[i]) for i from 1:N

Do you mean in general? Or what is the scope of this statement?

I guess the most general form of the question I’m trying to ask is this:

Assuming that we have some data with associated uncertainty that we can sample from, so we can get a distribution of likely values for that data. Is it always true that if you take each possible sample of the data and run a model on it separately, then merge the chains of those models, you would get the same final posterior distribution as if you had directly marginalised out the uncertainty by using all the samples in one model? (Let’s assume there isn’t anything weird going on with differential rates of mixing, and we’re treating each sampled value as equally likely.)

E: Thanks for de-confusing me, by the way. I’m not sure where I’d got into my head you couldn’t do that.

Thanks for the help and suggestions.

@paul.buerkner , do you think there is some way to use cov_ranef = list(phylo[1:100]) in brms_multiple to do this, or is this method restricted to multiple datasets in data =

No way to do it in brm_multiple, but a simple for loop over models should suffice:

fits <- vector("list", 100)
for (i in seq_along(fits)) {
  fits[[i]] <- brm(..., cov_ranef = list(phylo = phylo[[i]]))
}
fit_comb <- combine_models(fit_comb)

Of course, this approach critically depends on the assumption that all trees are equally likely.

2 Likes

Honestly, I don’t think that assumption is justified, at all. If one had a set of weights, representing the importance to be given to each tree, would it be possible for combine_models to accomodate that?

Not with combine_models, but posterior_average can achieve this.

Presumably if you’re using a posterior distribution of trees from something like BEAST, each tree is “equally likely” in the important sense, given that they’re already sampled in proportion to their underlying probability? (Putting aside the issue that there are so many topologies that the majority will not have been sampled during the original analysis at all.)

I suppose so. Problem is that if you have a budget of 100 trees, it’ll be very hard for this sample to reflect the true posterior. You could still argue that your averaging procedure is sound because it reflects the phylogenetic uncertainty represented in that sample, I guess.

That’s definitely true, but I think it’s unavoidable. I think the phylogenetic uncertainty of the sample is the only reasonable solution unless you’re going to set up some crazy joint MCMC for the tree estimation and the downstream analysis (as you can do in revBayes, as I understand it). But if you’ve got a distribution of trees and you want to stick that into STAN/brms, it’s not clear to me where you’d get the weights for any kind of weighted analysis. Even if you’re subsampling a larger posterior distribution of trees for computational reasons, I’m not sure how I’d generate a sensible weighting those trees based on the whole posterior. Though, in fairness, that might just be because I’ve not thought about it much, maybe there’s an easy solution to weighting a subsample of the trees.

In order to get a consistent posterior you have to weight each tree topologies by marginal likelihoods conditioned on each topology.

Now for the bad news. To be polite these marginal likelihoods are extremely challenging to compute using information from only MCMC, such as samples generated conditional on each individual tree topology, and to be more accurate I will claim that we do not currently have the technologies to compute them accurately at all. On top of that we also don’t have the technologies to get a reasonably representative sample of the tree topologies themselves (no, neither revBayes nor BEAST accurately sample tree topology posteriors). These problems are extremely difficult and we’re just not there yet with regard to the necessary tools.

1 Like

There’s good news to go with the bad news. Most of the biologists I’ve dealt with are more interested in more specific questions like “how many trees does this split occur in?” rather than the entire tree and interpretation tends to focus on specific splits that occur in “most” trees so these outputs tend to get read as “better than ML/max. parsimony”.

IDK if there are models that do approximate summations to avoid dealing with the combinatorial explosion in tree-space (that typically happens far away from the site where inference is wanted) but if not it’s a good idea and somebody should do it.

…

Trying to “average” these trees though… that’s terrifying in terms of interpretability.

Kind of. See this paper for an analysis based on the principle that clades (“splits”, difference is subtle but irrelevant here) are approximately independent given some regularity conditions. Others have shown this assumption to be dubious in certain settings.

While tracking subtrees is certainly a very popular way of mitigating superexponential growth of the parameter space, it is not without its own problems. It can be shown that any two subtrees are either positively or negatively correlated (i.e. can’t be uncorrelated), for instance. It is not clear to me how the partition structure of clades/subtrees/splits behaves in the posterior (there are results under some popular priors).

The main point of the comparative method, however, is to use the phylogenetic tree as the scaffold of the dependence structure in the data (“phylogenetic variance-covariance” matrix). I don’t know enough to see how one could do the same stuff considering only clades/subtrees/splits.

1 Like

What do you think about the probabilistic path HMC stuff? It’s way over my head, but is that the kind of development you’re talking about?

I think, that’s commonly what’s their used for, yeah. “Is this clade supported in the majority of the trees in the sample?” If yes, interpret, if no, shrug and say that that part of the tree could not be well resolved. In molecular epidemiology, we’re very interested in using sequence data to track outbreaks for instance, and a poorly resolved part of the tree just tells us nothing about epidemiology at that stage. Comparative methods are a bit different though, in that the entire tree matters, because the implied variance-covariance matrix between the species depends on the full tree (i.e. the topology and the branch lengths), so you can’t just ignore the less well supported bits.

I’m not sure anyone’s talking about averaging trees, though. Well apart from the mathematicians looking at Frechet means of groups of trees in tree space in the theoretical literature, but I’ve never found a biologist who talks about averaging trees. Our tree summarisations are more for visualisation than interpretation, and, as you say, we’re generally looking at clade splits across the posterior. I think what we biologists are more interested is carrying forward the uncertainty in their phylogenetic hypothesis into downsteam analyses as far as is possible, using the logic that any single summarisation of the phylogeny is in all likelihood wrong.

I wasn’t suggesting ignoring. For a parallel example: a Stan model of a mixture of normals doesn’t ignore the assignment of data points to components but it does sum them out (no discrete parameters). The assignments can be simulated from the posterior after fitting without loss of information.

Yeah, that’s a little different because they’re changing the model. I was thinking of approximating the math rather than the model. I’ll see if I have some time to be more specific… :)