Model specification to combine two independent datasets

Hi everyone,
I am new to Stan and brms - thank you in advance for any recommendation or insights, and hopefully this post is appropriate here (my sincere apologies if it is not).

I am in the process of combining two independent, large datasets. One dataset (data1) included abundance data for n species across time. Each row in data1 represent an abundance value from a study, at a site, in a given year. The other dataset (data2) include multiple independent measures for a body size of the n species in data1. Each row in data2 has one body size estimate for a species; some species have multiple measurements for body size. Data2 also has an additional column describing the taxon for which body size is reported (e.g., plants vs. animals).

I would like to incorporate the effect of body size and taxon from data2 in the a model predicting changes in abundance ~ year from data1. I want to test how body size and taxon affect the slope of the relationship between abundance of a species and time (year).

I have limited experience with BUGS language and I am aware that there are multiple ways to tackle this question, from complex hierarchical models to more simple random effects in a GLMM fashion. I was wondering what would the stan community recommend in this specific case. I would be keen in using brms give the large size of my data table. I used dplyr::full_join() to combine data1 and data2. In this table, each abundance time series in data1 is repeated as many times as a taxon body size is reported in data2. This “final” table counts millions of rows.

The model that I am working on right now has random intercepts and slope for the effects of time and body size, a fixed random effect on species, and a fixed effect of taxon

abundance ~ year + (year | study_ID_data1/site) +
body_size + (body_size | study_ID_data2) +
(1 | species) +

I would appreciate any feedback on whether this direction makes sense or if there is a simpler way to integrate the data from data2 in predicting changes in abundance from data1. I designed this model following Harrison et al. in Peerj (A brief introduction to mixed effects modelling and multi-model inference in ecology [PeerJ]) to analyze the data with a simple glmm, but if brms has a built-in multilevel model that would work well with the sort of data that I described I would be very thankful is you could help me in figuring this out.


I fear that given the size of the dataset, Stan could be potentially too slow. However, your model is relatively simple compared to the amount of data, so an approximation like INLA (via inlabru) could be well-behaved and much faster.

In any case, I don’t think it makes sense to use a full_join to combine the datasets - this effectively puts more weight on abundance measurements of species that have more measurements of body size, which is unlikely to be desirable. I think you would need to clarify how body size (which is a property of individuals) should be related abundance (which is a property of populations). The most obvious choice would be to use something like average body size (which is a property of populations). The simple approach would just compute the average size and use that as a predictor. However, since you know that the there is uncertainty about the average body size (especially for the species with few measurements), it might be better to treat this as a measurement error model where you would compute the standard error of the mean for the species size and pass it as the measurement error.

The measurement error is itself an approximation to a joint model where you use the individual measurements of body sizes as data that inform the average body size, and then this average body size is a predictor, but such models cannot AFAIK be expressed in brms and you would need to write a custom model in Stan (or potentially using the rethinking package which lets you write Stan code using R formula syntax).

I would also suspect you might nead a species:year term (either as fixed or varying effect) as the time trends for different species can plausibly be very different.

Does that make sense?

1 Like

Dear Martin,
Thank you very much for your reply. Yes, your response makes sense. Some comments below:

  • I should have mentioned that body size values are sampled from multiple independent populations (not individuals); and, yes, averaging is on the table. The issue is that we have measurements of body size take in different ways (e.g., wingspan vs weight) for the same species, so that modeling this source of heterogeneity would be desirable.

  • Re: full_join and “putting more weight on abundance measurements of species that have more measurements of body size”, would that not be a problem using species as a random effect? Different species have different numbers of time series anyway. I am not sure I get here why it would not be possible to account for different number of measurements with random effects

  • the measurement error model looks great, thank you for sharing. Will look into it for sure.

  • yes, the model has to be able to estimate different slopes among different species, and estimate whether body size mediates these slopes.

Thank you again for your time and consideration.