I have some massive datasets (> 1mill rows).
In order to run them more quickly I thought of using the
brm_multiple() function to run subsets of the data then just average everything together.
It looks like each subset is running nicely and the combined coefficients are almost spot on with the model run as a single dataset. However, it reports that it has not converged and the effective sample is incredibly low.
library(brms) # Simulate data df <- data.frame( outcome = c(rep(0, 5000), rep(1, 5000)), predictor = c(rnorm(5000, 5, 5), rnorm(5000, 10, 5)), df_subset = sample(1:5, 10000, replace=TRUE) ) # Run model with single dataset test.all <- brm(outcome ~ predictor, family = bernoulli(), cores = 3, chains = 3, data = df)
# Split dataset based on subset indicator (five groups) df_split <- split(df, f = df$df_subset) # Run each subset without combining to check it is performing correctly test.multiple <- brm_multiple(outcome ~ predictor, family = bernoulli(), cores = 3, chains = 3, data = df_split, combine = FALSE)
Trust me these were in a similar ball-park as above.
# Combine to have a single model fit test.multiple.combine <- brm_multiple(outcome ~ predictor, family = bernoulli(), cores = 3, chains = 3, data = df_split) * brms Version: 2.7
So it seems to work okay but just looks a little weird.
If anyone has played around with a similar approach or can lend advice then I look forward to hearing it.