Incorporating a species occurence matrix as predictor in brms

Dear Paul,

I have a small problem trying to fit a predictor in a brms model, and I was wondering if you could provide any help?

I’m trying to fit a simple multi-level model in brms using the following formula:

y~B1+(1|random), where
y= a response variable with count data, following a poisson distribution
B1: a species occurence matrix with 18 rows and 15 columns
random intercept: Site effect

My questions is: Is there any way to incorporate this species occurence matrix as a predictor of my response variable in brms?

A reproductible example of my dataset can be done using the following code:
#creating a species occurence matrix (variable B1)
df ← matrix(sample(c(0, 1), 180, replace = TRUE), 18)
colnames(df) ← letters[1:10]
#creating the y variable

#The type of models I intend to create should look like this
M1=brm(y~df+(1|rownumber(y)), family=poisson())

Many thanks!


Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
brms Version: 2.13.5
R version: 4.0.0

1 Like

How is the matrix meant to predict the response? For a given observation y[i], what is/are the relevant covariates (what postion(s) do they occupy in the matrix)?

What I had in mind was having a unique effect size corresponding to the general dissimilarity between communities amongst sites in my database, either by using a species-occurence matrix or a beta-diversity matrix to understand how does the presence of some species influence my response variable.

If your predictor is a dissimilarity between a pair of sites, then your response needs to be a property of pairs of sites, rather than of single sites. If the response is a property of a single site (e.g. net primary productivity), then it doesn’t make sense to use dissimilarities to predict the response–each site has N-1 dissimilarities associated with it, one for every other site. Which one do you use to predict the response? (That’s a rhetorical question; there’s no good choice).

Generalized dissimilarity models (see here deal with modeling dissimilarities between sites based on the distances or dissimilarities between sites in other covariates. For a maximum likelihood implementation, see R package gdm. Uncertainty quantification in these models is nontrivial because of the dependency structure that exists in the response variable. Current state of the art for Bayesian-style uncertainty quantitfication is maximum likelihood estimation coupled with the Bayesian bootstrap; see here:

1 Like

There’s also this:

I’ve recently implemented this type of model via brms and it worked reasonably well, but there were some problems in making the response distribution match the distribution of distances and convergence was a bit problematic to achieve, so cannot recommend 100%, although I find the approach theoretically appealing.

Happy to share a bit more detail if you want to try going this way.

1 Like

just add df as a matrix column to your data.frame and use it in the formula.

This just treats each column of the matrix as a covariate, right? This is in line with what the OP seems to say here:

but at odds with what the OP says here

Note that the simulated data contain a response vector of length 18, an observation-level random effect, and then a matrix predictor with 10 columns, so fitting each column as an independent covariate would probably require some really strong priors.

@martinmodrak I’m really interested in this! But I’m also immediately a bit confused. After we partial out the variation due to the covariates (so from here on I’ll assume a model with no covariates), it seems like the assumption here is that the variation can be partitioned into a set of independent residuals and then a set of additive terms along a one-dimensional latent gradient (where sites’ positions along the gradient are given by their site-specific random effects). And along this gradient, everything needs to be additive. That is, we assume that we can decompose the variation into a set of independent residuals and a latent gradient along which the distance between A and C is the sum of dist(A, B) and dist(B, C). Do I have this right (I assume that I do not…)? If so, I don’t see how this method captures the residual dependency structure in pairwise compositional dissimilarities, because the dependency structure generally won’t be one-dimensional.

Put another way, it seems that the claim here, translated to frequentist methods, is that we can replace mantel tests with site-specific intercepts. I’m pretty sure that’s incorrect, and so I suspect that I’m failing to grasp something important about the approach.

1 Like

First, sorry @Afilion for hijacking your thread a bit - if you are not interested in the details of the Dias et al. approach, we can definitely move the discussion elsewhere.

I think you are right, in that the decomposition without covaraites is 1-dimensional. My current thinking about this: if I were to implement something like a 1-dimensional PCoA/MDS as a linear model, it would be:

dist_{i,j} \simeq |\alpha_i - \alpha_j|

where \alpha are site-specific intercepts, which I could model as “random” i.e. having a hyper-prior. Such a model would in fact be quite annoying to fit in a Bayesian setting as it could easily be irreducibly multimodal.

However, the raw linear model Dias et al. propose would be different from that as they turn the difference into addition to get:

dist_{i,j} \simeq \beta_0 + \alpha_i + \alpha_j

I am not sure if this model actually resembles PCoA in a meaningful way, but it seems to still be some 1D decomposition.

However, I don’t think treating the residual as having just 1D structure is that bad: in many cases, the dissimilarity data can be neatly embedded in just 2-3 dimensions so if I can hope that my covariates somewhat “capture” (without any precise mathematical meaning) 1 - 2 dimensions, a 1D residual structure seems pretty OK.

The appealing part is that I can know use the rich available options for linear and hierarchical structures which would be difficult to take into account in PERMANOVA designs (I’ve used vegan::adonis a few times, but I have no experience with Mantel tests, so can’t comment on those). E.g. the killer feature for me was that in a longitudinal study I can now use the distance at baseline as a predictor for distances at later time points.

My single practical experience is that such a model can represent the distance matrix resonably well and the results are in-line with those obtained by different methods. The per-site varying intercepts should hopefully be enough to avoid the model being overconfident. Also my impression (though I am not very confident about this) was that the variance-explained values you get from vegan::adonis in fact are the same you would get by treating the distances as independent observations and running a normal ANOVA with differences in predictors as independent variables - only the p-values seem to actually reflect the matrix structure of the data.

1 Like

Thank you very much all for your swift and very helpfull suggestions!

After a few try, I decided to slightly modify Paul’s suggestion according to @jsocolar concern about the small size of the database (witch, without strong priors, leads to significant convergence problems). In the end, I randomly sampled 6 species from my community matrix (which seemed to be the maximum I could have without problems) and bootstrapped the random sampler 100 times, yielding a list of 100 community matrix of 18 sites and 6 species each. Afterward, I ran each database in the list as a predictor in my model, and used leave-one out cross validation to weight every model. Then, I used the loo_model_weigts function to stack every model, and selected all that were the best fit for my data. In my case, it was pretty obvious that the pattern of my response variable was driven by only two species, and I fully agree that this solution is probably not the most elegant nor the most practical.

An alternative would be, I believe, to follow the sugestion of @martinmodrak and use an RDA to work on environmentaly-constrained axis, and extract the residual variance for species from the unconstrained PC axis. It would then be possible to predict a global variance of species by sites. Again, I’m not very confident this represent a wise solution to this problem.

Nevertheless, I quite liked the proposition of both @jsocolar suggestion in using GDM’s in a bayesian framework and @martinmodrak hierarchical community modelling, and I plan to try and work with those in the future to make a more elegant model!

Feel free to let me know what you think (or let me know if I’ve made assumptions that should not be done)!



Jumping in here because I have a strong interest in finding an answer to this problem! For context in my case, we are working on an experiment where we want to model the disease severity of a tree (ranges from 0-1) as a function of the composition of its microbiome, and identify particular microbial species that may reduce severity. It’s possible that we will recover 10,000 microbial species. We know that the disease is caused by three particular bacterial species.

I’ve been thinking about various ways to model the effect of community composition on a given response variable over the past little while, because it’s definitely not straightforward.

I think the most obvious solution is to model the response variable using a linear model, with an individual coefficient for the presence or absence of each species. With 10,000 species though, that would require some kind of insane regularising prior for each coefficient, as @jsocolar says, and/or to model each of those coefficients using some kind of varying effect. And that’s not even getting in to the possibility of species interactions - including all pairwise species interactions would add an insane number of additional parameters. That said, this approach is similar to a GWAS (could call it an “MWAS”), so there may be approaches that simplify this, though the specifics of a microbial community versus a genome are obviously very different and have different constraints. GWAS also uses techniques like Mendelian randomisation to infer causality, which I am not sure can be directly translated outside of a genetic context.

The only other realistic option I’ve come up with is to go the other way, modelling community composition as a function of the response variable. Taking the approach of a package such as boral and implementing a GLLVM (but in Stan, of course!), one could look at the residual correlations (conditional on severity) among species to try and infer interactions among species, (in my case looking for negative interactions with the three disease-causing species). However, while this may be able to show some kind of interaction, it wouldn’t give a directionality (who suppresses who), and it also requires prior knowledge about species that cause disease, or your response variable of choice.

I don’t know if any of these thoughts are useful to you, @Afilion (I’m reasonably new to the world of Bayesian modelling), but if anyone has any more insight, I’d be very interested in finding an approach that allows as direct an inference of the effect of each individual species as possible.


I think @prototaxites is spot on.

Yes, but that also means GWAS-like sample sizes (AFAIK at least hundreds, probably more), which are as I understand completely infeasible for most microbiome studies.

Barring extraordinary increase in available sample sizes, I think purely data-driven approaches to determine associations between large communities and outcomes are just not really possible and you need to bring more theory/background knowledge to the table. The only thing that comes close to this that I am aware of (although I am not an expert and I believe @jsocolar to be better informed) is to do some sort of dimensionality reduction (e.g. NMDS or PCA/supervised PCA) and then use the positions of the samples in the reduced-dimensional space as predictors. In some cases (at least for PCA), one could then even project the fitted coefficients back onto the original space. I however expect that this brings with itself some substantial unverifiable assumptions as well, I am just not completely sure what they are.

If you can bring more theory/background knowledge to the table, than I think an attractive approach is using the log-ratio of abundance of selected species/species groups as predictors as those quantities respect the compositional nature of metagenomic data. I’ve recently helped with a project where they developed such a biomarker by using a mix of background knowledge and squinting at abundance heatmaps to determine “good” and “bad” bacteria and then use log(sum(abundance[good])) - log(sum(abundance[bad])) as the marker. And it proved quite robust even for out-of-sample data (much better than any species in isolation or just the sums). I’ve found this preprint on the topic quite helpful [2104.07266] A Critique of Differential Abundance Analysis, and Advocacy for an Alternative , although I am somewhat skeptical about the possibility of finding good markers automatically (without massively increasing the available sample sizes).


That’s very true. In our case the study is of unprecedented size (350 trees), with a balanced design for disease severity. This is hopefully amenable to this kind of approach, though I think it will be a “give it a go and see” type of situation.

This is a really interesting idea. The GLLVM approach I linked above reduces to a PCA in its simplest case, so this might be worth further investigation, though of course it would require some detailed thought!

Another idea that occurred to me over lunch would be to reduce the community data into groups of co-occurring species using some kind of network/clustering approach and use the linear modelling approach with these. Particularly with microbial communities, functions for particular metabolic pathways are shared across multiple species, and thus you get groups of species that almost always co-occur. That kind of collinearity would make the “MWAS” approach difficult to read out anyway, so reducing the number of “species” to a smaller set of species clusters might make the approach more feasible.


There’s a lot to mull over here, but some clear ways forward.
A lot hinges on whether you expect the effect size to ultimately come down to the abundance of a small number of species, the (relative) abundance of some broader phylogenetic groups, or the location of a community in ordination space in the broadest sense. Note that in any of these cases, inferring causality is difficult absent rigorous experimental control. In GWAS, the genome is generally assumed not to be under the influence of additional causal factors that influence the trait (not that this assumption is not necessarily sound, for example if the trait under study is influenced by factors related to economic disparities between ethnicities). In your case, it’s not too hard to imagine that the microbiome will be influenced by environmental factors that also directly causally influence the disease severity.

With that said:

  • Searching for a few species that have causal influences under assumptions of sparsity feels very analogous to GWAS. With careful experimental design including a strict data withhold for model validation, it seems like this could be possible.
  • Searching for associations with the location of a community in broad-sense ordination space will require some kind of dimensionality reduction (that’s what ordination is, after all). Usually, but not always, PCA is not the most informative way to reduce the dimensionality of compositional data, because the data are often sparse and PCA gives undue weight to shared zeros. However, PCA does have some nice analytic properties that might be useful. Personally, I’m a fan of NMDS, which is often criticized in these contexts for its lack of interpretability as a distance (but this is debated), but has the advantage that you can think carefully about the right dissimilarity metric to use for your problem. GLLVM (what’s implemented in boral) is a model-based ordination technique that takes as the ordination axes the columns of a reduced-rank approximation to the covariance matrix. In your context, its main advantage would be that you can fit everything in one step and allow the relationships to your outcome variable to inform the ordination itself in a self-consistent Bayesian framework. An possible additional advantage is that you can include covariates. So if you think that some shared environmental predictor is affecting both disease and the microbiome, you can do the ordination on the residuals of the community composition after the environmental factor is partialed out, again in a self-consistent Bayesian framework.
  • There is an in-between option that I strongly recommend checking out. Unfortunately it’s non-Bayesian and has nothing to do with Stan… My friend Alex Washburne has an algorithm called phylofactorization (Ecological Monographs paper here; R package here). If you take your metagenomic data and put it on a phylogeny, this algorithm will identify entire clades (not necessarily monophyletic) whose overall relative abundance is associated with your response variable. So if you think that what really matters isn’t the presence-absence of a few key OTUs, and also isn’t ordination in the broadest sense, but rather is the (relative) abundance of certain groups of phylogenetically related OTUs, then phylofactorization will identify those groups for you. I’m a little bit fuzzy on the degree to which phylofactorization can be used for testing hypotheses versus as an exploratory tool. You might find it prudent here as well to implement a strict data withhold for model-based validation of the relationships recovered by the phylofactorization (which at its heart is based on a greedy optimization algorithm).

Edit: note that phylofactorization was developed especially for doing meaningful analysis on microbiome datasets, and has been used extensively for that purpose.

Edit2: For anybody else who comes across this thread, a good introduction to the GLLVM approach is David Warton’s really nice paper from a couple years back. It’s a very cool technique that I think should be amenable to sampling with Stan.