Distribution of the rooted tree

Hi guys,

Is there any way to model the distribution of the rooted tree (graph topology view)? I have a known matrix N that is a multiplication of two unknown matrix U and B (N=U*B). I know each column of U is sum to one (simplex vector) and the B matrix can have only two integer value zero and one and is equivalent to the rooted tree. Any suggestion? Thank you in advance.

Stan doesn’t support integer parameters and the marginalization over the values in B would be intractable. If you do find something that says it fits this kind of thing, I’d be careful to examine the multimodality and the ability to recover parameters from fake data.

@betanalpha has been thinking a bit more about tree topologies, but they’re going to require specialized samplers.

And new math. We’re nowhere close to understanding how we might approach anything like this, let alone implementing it efficiently.


I’m not sure what your application is, but that sounds similar to the phylogeny inference problem. If that is what you are doing, I’m sure you are familiar with the relevant literature. However, if you are coming at it from another domain, you might want to check out something like BayesTraits.

All existing Bayesian phylogenetic algorithms are based on either heuristic discrete MCMC moves or heuristic random jump MCMC moves (ultimately equivalent) that yield only local exploration of the possible tree topologies. From what I and some of my colleagues can identify the resulting exploration is poor and the resulting chains end up exploring just the neighborhood around the initial point, akin to how Random Walk Metropolis behaves in high dimensions. In other words, you get something slightly better than a point estimate but nothing close to an accurate quantification of uncertainty in the phylogenetic space.

As I implied above this is an active research topic that multiple groups around the world on pursuing, but we’re missing some serious conceptual gaps in the mathematics that would allow us to build something generic enough to be applicable to applied problems.

For example, why does dinosaur ancestry keep changing every few years? Because the inferred phylogenies are based on fragile point estimates and the inclusion of any new data makes drastic changes to the inferred topology. Were we able to accurately quantify the entire posterior over phylogenies then the inferences would be much more stable across time as new measurements are included. Even if you don’t care about the science, at least care about destroying the favorite dinosaurs of young children. Won’t somebody think of the children?



I was wondering if you know any application of Stan for phylogenies?
Do you have any suggestion about how I can sample from a tree?

Thank you.

Unfortunately, no. Conditioned on the tree topology you just have a bunch of edge weights which can be fit with Stan reasonably well, but the actually tree topologies themselves are difficult to sample from.

Because trees are relatively simple graphs, however, one might be able to get away by assuming a fully connected tree topology and then try to infer all of the weights. Or the weights may have additional structure to match the nested structure of the tree – for example, the weight of a node is the product of the weights of its parents.

But at no point would you be able to infer that a given edge is “zero”. Instead of asking whether an edge exists you would instead have limit yourself to questions of how strong an edge is.

1 Like