How to average repeated brms models into one, to please reviewer #2?

Hi all,

I have a brms model I want to fit on a dataset. However, I need to for loop it so that it picks out one random sample from each treatment repeatedly during each run so that I can deal with sample size difference issues to please reviewer #2 :(

Can I for loop the subsetting code and the brms code into one and then take the average of each iteration and combine them into one final model at the end?
Any info would be much appreciated!

Here is the code for my model:

This is the subsampling code where I randomly pick one sample from each treatment

data2 = data %>%
group_by(taxA) %>% # if ‘Species’ is the name of the column countaining your host species names, otherwise replace with the appropraite column name
sample_n(1) #that randomly take one row per group

#####And then this is just additional filtering

sub ← data2 %>%
filter(!speciesA==speciesB) %>% # filter out comparison within same species
rename(dietAandB = dietA:dietB) %>%
mutate(speciesAandB = paste(speciesA, speciesB, sep = “:”)) # New name to standardize

And then here is the code to run the model

fit ← brm(similarity ~ dietAandB,
data = sub,
family = “beta”, backend=“cmdstanr”, threads = 10)

1 Like

I think I can answer this question, but first I want to make sure I understand:

This is the subsampling code where I randomly pick one sample from each treatment

sub ← data2 %>%
filter(!speciesA==speciesB) %>% # filter out comparison within same species
rename(dietAandB = dietA:dietB) %>%
mutate(speciesAandB = paste(speciesA, speciesB, sep = “:”)) # New name to standardize

This does not appear to randomly subset anything. Did you paste in the right snippet here?

Ack! Thank you its been a long day. Fixed it to show the actual subset code as well

Cool! The safest way to combine inference across these models will be to literally concatenate the posterior draws from all models and treat the resulting (much larger) set of draws just like you would any other posterior. This new posterior will incorporate the uncertainty induced by the resampling procedure.

A couple of other random comments:

  • You can probably preserve a (very slightly) larger sample size by doing your filtering before doing your random subsetting. If the one row that you subset for a given taxA happens to have speciesA == speciesB then you’ll lose that row in the filtering step. If you filter first, then you’re guaranteed to end up with a row for every taxA.
  • Sparsely re-sampling pairwise comparisons can be beneficial not only to balance out sample sizes across species, but also because responses based on pairwise comparisons are non-independent (i.e. if A is similar to B and B is similar to C then A must be at least a bit similar to C), which leads you to substantially overestimate the informativeness of the data if all pairwise comparisons are modeled jointly as independent. I don’t know what reviewer 2 had in mind exactly, but this is an important consideration that is ameliorated by sparsely sampling the pairwise comparisons.
  • this code:

will run 4 chains sequentially on 10 threads per chain. It would likely be faster to pass cores = 4, threads = 2 which would run 4 chains in parallel on 2 threads per chain.

@jsocolar Wow you were able to understand my analysis surprisingly well despite my lack of detail. Appreciate it.

One question: (FYI I do not understand a lot of the modeling technical terms but am guessing posterior just means the result of a model run?) When I for loop this, will I basically get fit1, fit2, fit3, fit4, and then do I basically combine these? Is there a specific command that will do this?

Cheers,
Sam

1 Like

The easiest way to do this would be through brm_multiple. If you put all of the dataframes into a single list and that pass that to brm_multiple with your formula, it will automatically run the model for each dataset and concatenate the posteriors before summarising.

For example:

data_list <- list(df1, df2, df3, df4, df5)
fit <- brm_multiple(your_model_formula, data = data_list)

brms can also use the future package to estimate the models for each dataframe in parallel (check the examples in the link above for more info)

3 Likes

Hi @andrjohns thank you that was exactly what I was looking for.
I made 3 new datasets using the subsample code (first chunk of code in my original post)
But then when I run the brm_multiple command I get this error:

All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 1.3 seconds.

Start sampling
Error: Models 1 and 2 have different parameters.

Any ideas what this might be?

1 Like

My first guess is that dietAandB might be categorical, and some of its levels might be missing in the different models, resulting distinct sets of parameters (since each level of the factor gets its own parameter in this model). Does that seem possible?

1 Like

@jsocolar Yes! Thank you. Looks like this goes back to your first bullet point as well about filtering before subsetting. I basically swapped the order and that lined up the categorical variables perfectly. Thank you @jsocolar for the insight and @andrjohns for the code. Big life saver for someone new to all this. Marking the brms_multiple command as “solution” so people can see the actual code that worked for me.

1 Like