Phylogenetic vignette and loo

I was just reading the updated phylogenetics vignette and I noticed the model comparisons were happening with loo(). I think this isn’t valid, as leave-one-out requires samples to be iid and phylogenetic data definitely does not qualify.

Yeah, indeed loo might not be the most appropriate choice here. CCing @avehtari

I guess you mean this vignette?

loo doesn’t require samples to be iid, but exchangeable. In that vignette single observations phen are exchangeable given cofactor and phylo. Having a prior with correlations between phylo groups doesn’t change that. See, e.g. chapter 5 in BDA3 for conditional exchangeability. If the prediction task is to predict phen for a new animal in one of the observed species given cofactor and phylo, loo is fine. If the prediction task is to predict phen for new animal of species without phen observations but known phylo covariance, then leave-one-species-out could be used instead.

In this blog post, it’s suggested that LOO can be problematic if samples aren’t iid:

Unless of course I’ve misinterpreted something, which is possible!

It seems I need to talk with @anon75146577. When Dan writes

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.

leaving out the whole future instead is not necessary because of non-iid assumption, but because that is the usual prediction task for the time series as you don’t have the future available when making the predictions. There can be also time series where the future is available: consider for example an old vinyl record of with scratches making parts of time series presenting music missing. When making a model to fill the gaps (or remove snap, crackle and pops) we can use also the future. Naturally it would not be leave-one-out, but, e.g. leave-one-second-out or something. Also there are time series and models where loo is ok even when the future data is not usually available (Burman and Nolan, 1992).

Instead of iid, it would be better to use term exchangeability and I continue with that. In a hierarchical model observations are not assumed exchangeable, but they are assumed to be exchangeable in each group and groups are assumed to be exchangeable.leave-one-out corresponds to predicting a new observation in one of the existing groups and it’s ok if we assume exchangeability of the new observation and the old observation inside the group. leave-one-group-out corresponds to predicting a new observation in a new group and it’s ok if we assume exchangeability of of the new and old groups.

Time series is like a hierarchical model where often each time point happens to be it’s own group. Based on what I said about hierarchical models loo is fine if we want to predict a new observation for one of the existing groups, but usually we want to predict for a new group, that is a new time point. But for time series the groups (time points) are not exchangeable! For time series the order of the groups (time points) matter, and that’s why when we want to predict for a new time point we need to take into account the structure and the prediction task.

In the phylogenetic model, there is dependency structure for the species assuming similar parameter values for similar animals. Inside a species group observations are exchangeable and loo is fine. If we want to predict for a new species and we know the phylogenetic dependency of that new species with other species, and we assume that a priori species were assumed exchangeable (before building that dependency model), we can estimate the goodness of our phylogenetic dependency model by using leave-one-species-out cross-validation. If you want to know the predictive performance for a specific new species for which you have extra information breaking the exchangeability you need to work harder to make that estimate.

In my opinion iid is causing lot of confusion in Bayesian statistics and especially in case of cross-validation. Exchangeability is better term, and I repeat my suggestion to read at least the relevant part of the chapter 5 in BDA3 (BDA3 has more examples on exchangeability than previous editions).


I think I’ll disagree with you here. In a typical bifurcating phylogeny, even species within a related group will have different relationships with each other, as certain members within the group will be more or less closely related. For instance, if species A, B, and C are in a group, A and B might be more closely related to each other than either is to C.

Thus, in practice, every observation of a species tends to have a unique set of covariances with the other species in the tree, making observations not fully exchangeable. If the tree has a lot of polytomies/multifurcations, then it might be safer.

Again, assuming I’ve understood this correctly, here is an example, taken from

Of the 15 groups in this tree, only 8 (the pairs) would be exchangeable. It is of course possible to have even more extreme tree structures, where no species are fully exchangeable with each other because there are no sister species pairs.

I don’t see a reason for disagreement, so maybe you can elaborate?

I’m not an expert in phylogeny and I only commented for the models in that specific vignette and I also mentioned two different prediction tasks. I’m happy to learn more about phylogenic models, so can you tell whether you disagree with some of the models in that vignette and what would your prediction task be in those cases?

No disagreement here: As I wrote, if you have extra information which breaks the exchangeability for your prediction task then you need to take into account that structure.

No disagreement here. As I wrote, observations do not need to be fully exchangeable for loo being useful. Partial exchangeability just introduces differences in different prediction tasks, and depending on that task, you may need to take into account the structure of the exchangeability.

No disagreement here. loo is still ok for estimating the predictive performance inside species (unless there is additional information about exchangeability structure there, too). In the vignette, the information from these trees was coded as covariance matrix. If we are interested predicting observations for one species given the observations from other species and that covariance matrix, then leave-one-species-out can be useful, but it will be more difficult to estimate how well we can predict for species which are much different from others (as there is less information shared via the joint prior). If there are different ways of constructing the tree and the covariance matrix, then leave-one-species-out approach would be better than leave-one-observation-out for comparing different trees and covariance matrices. In the vignette the covariance matrix was given. Species are exchangeable conditional on that tree (or covariance matrix). The usual example for conditional exchangeability is that y_i are not exchangeable if know x_i, but pairs (y_1, x_i) are exchangeable. In this tree example, you can think of a new species not yet in tree. In order to make predictions for the new species you would like to add it to the tree to make conditional predictions given the other species.

Summary: loo is not invalidated just because there is dependencies. loo can be invalid depending on the prediction task and exchangeability structure. Before using loo, think what do you want to learn and what do you know about your data and model.

Thanks for your comments Aki.

I think all of the models in that vignette are solid and the flexibility of STAN is incredibly impressive. It really can do things that few other phylogenetic methods are capable of. I’ve been enjoying the heck out of using it!

My main point is simply that the vast majority of real-world phylogenetic datasets will break exchangeability, so selecting loo() as the model comparison device of choice in the vignette is likely not the best way to teach new users how to do model comparison. In the worst case, it could impede the ability of STAN/brms to make inroads into this community, which I’d prefer to avoid.

My general impression from your 2017 paper is that WAIC has fewer issues with exchangeability. Assuming I’m correct in this, I would suggest that using it instead of loo in the vignette, with perhaps a quick caveat about loo and tree structure and/or multiple observations/species, would be wise. Or I’m wrong in which case I’d love to have a better sense about what model comparison techniques are likely to be valid for things like phylogenetics or time series analysis, etc.

I would love to examine the specific phylogeny in the vignette directly but I have not been able to find it.

Thanks for good discussion, this is useful for me, too.


Dataset and model itself do not directly invalidate loo. You have to include the decision/prediction task! I wrote about two possible prediction tasks. In one of them I still claim loo is fine, and in another one I assume cross-validation is fine. Since I’m not an expert on phylogenetics I do not know what is the goal of the modelling. Could you help with that? It would be great to have an additional example, where the prediction task is such that loo is not applicable.

NO! LOO and WAIC have exactly the same exchangeability and prediction task assumptions. The difference is only computational. WAIC’s truncated Taylor series approximation makes it biased towards “training set performance”, which asymptotically (n goes to infty, and the number of parameters is fixed) is not a problem. In finite case, and especially in case of misspecified models WAIC can fail badly and it is difficult to diagnose well when it fails. Exact LOO is computationally costly. PSIS-LOO has similar computational cost as WAIC, but doesn’t fail as easily and when it fails, it is easier to detect.

It depends on the prediction task. For time series the prediction task is often (but not always) such that we don’t have any of the future observations available when making predictions, and then m-step-ahead-prediction approach is sensible (we’ll publish a vignette for that before StanCon 2018 Helsinki). For phylogenetics I need help from someone who knows what is the decison/prediction task. If you are interested and have time, we could make a vignette together? Or add an additional example(s) to the brms vignette?


Thanks for your thoughts.

I think there’s potential for a vignette, or adding to an existing one, though I don’t think I have enough time until August. Depending on the interest level, I might suggest that you, myself, and Paul write a STAN phylogenetics paper for the evolutionary crowd, where we perform several popular phylogenetic methods in a STAN framework.

In general, with a phylogenetics model there is only very rarely interest in predicting new data points (ie. unsampled species). This is not to say that predicting these things would not be interesting, just that in the traditions of this field it is not commonly done.

The primary concern is assessing which of a given set of models fit the data best. So, prediction is only ‘interesting’ insofar as it helps one assess and compare models. In practice this is usually because if you don’t have trait information for a species you probably also do not have the genetic/taxonomic information to accurately add it onto the tree.


Sure! Are you coming to StanCon in Helsinki?

I am in as well. Would love to talk to you in Helsinki, otherwise we could think of having a skype meeting or the like.

1 Like

Please do! I think the community would benefit a lot. Something I would like to see done/discussed in a paper like that is how to incorporate phylogenetic uncertainty (maybe something like this?) and how much it matters. This could be a case where predictive checks would also be interesting; looking at how entropic predictions get with/without phylogenetic uncertainty.

Regarding model adequacy/fit specifically, I suppose it’d be interesting to compare loo with what is usually employed in the comparative method literature, which are Bayes factor/marginal likelihood-oriented methods*.

*I believe, not an expert in the comparative method.

OK. I have other commitments this year (setting up my lab!) but I will definitely make attending a future STANCon a priority!

OK, if there is such interest, I can start putting together a notebook demonstrating how to convert several popular phylogenetic comparative methods to STAN/brms and we can go from there to a paper and vignette. Aki, Paul, I’ll contact you via email shortly.

It is unfortunately still rare for people for people to generate a distribution of possible trees prior to performing a comparative analysis, but that is of course the way things should go, and the best papers certainly do this. It would be awesome to have some easy options for a posterior distribution of VCV, though I shudder a bit about what that might do to runtimes! It might also be possible to do a mean + SE for each value in a VCV, based on the posterior distribution of trees. Either of these would be very nice additions to a “phyloSTAN” paper.

It’d be interesting to see whether mean + sd of VCV would be a good option compared to, say, picking the MCC tree or some other “central” tree and the two (or ten) most distant trees from it. Re. running times, yes, incorporating uncertainty as in the link I posted would involve use of log_sum_exp and that is usually not fast at all. But it seems to me that most comparative method analyses run on average-sized phylogenies (<100 taxa) so maybe that’d be feasible.

Good luck!

I wrote some test code at one point to run the method in that link (I never got particularly far with it as more urgent work came up), but at least for small phylogenies, it seemed perfectly feasible speed wise. But if you’re wanting to do any cophylogenetic analysis (such as the stuff in Rafferty and Ives or Hadfield and Nakagawa), the slowdown was too extreme for it to be practical (at least the way I coded it, which was just a modification of Diogo Melo’s animal model with the solution in that thread added).

I wrote more about when LOO and other cross-validation approaches are valid discussing also phylogenetic tree model.