I have a dataset consisting of DNA sequence counts of approximately 1,000 species across 350 samples, along with associated sample-level environmental information. I am interested in estimating the response of each species to a specific environmental factor, but this factor is missing for some samples. However, I have information I can use to impute this missing data.
When building a multivariate model of this sort in brms
before, I have converted the data into ‘long’ format, and estimated species-level responses to environmental factors using a hierarchichal model:
brm(Count ~ env_factor + env_factor2 + (1 + env_factor1 + env_factor2 | Species)
I would like to run the above model with imputation for the missing data:
brm(
bf(Count ~ mi(env_factor1) + env_factor2 + (1 + mi(env_factor1) + env_factor2 | Species)) +
bf(env_factor1 | mi() ~ predictor)
)
However, when data is in this format, imputation is not possible as the missing data is imputed separately for each sample/species combination, rather than once at the sample level.
Is it possible to re-format the above model using the brms multivariate syntax, such that imputation happens once using the information from all species for a single sample? My understanding of the syntax suggests that it’s not possible to produce such a model, but I thought it would be worth checking before I implement this manually myself in Stan!