Splitting large datasets to increase efficiency

haututu · March 13, 2019, 7:13am

I have some massive datasets (> 1mill rows).
In order to run them more quickly I thought of using the brm_multiple() function to run subsets of the data then just average everything together.

It looks like each subset is running nicely and the combined coefficients are almost spot on with the model run as a single dataset. However, it reports that it has not converged and the effective sample is incredibly low.

library(brms)

# Simulate data
df <- data.frame(
  outcome = c(rep(0, 5000), rep(1, 5000)),
  predictor = c(rnorm(5000, 5, 5), rnorm(5000, 10, 5)),
  df_subset = sample(1:5, 10000, replace=TRUE)
)

# Run model with single dataset
test.all <- brm(outcome ~ predictor,
                         family = bernoulli(),
                         cores = 3,
                         chains = 3,
                         data = df)

# Split dataset based on subset indicator (five groups)
df_split <- split(df, f = df$df_subset)

# Run each subset without combining to check it is performing correctly
test.multiple <- brm_multiple(outcome ~ predictor,
                              family = bernoulli(),
                              cores = 3,
                              chains = 3,
                              data = df_split,
                              combine = FALSE)

Trust me these were in a similar ball-park as above.

# Combine to have a single model fit
test.multiple.combine <- brm_multiple(outcome ~ predictor,
                                    family = bernoulli(),
                                    cores = 3,
                                    chains = 3,
                                    data = df_split)
* brms Version: 2.7

So it seems to work okay but just looks a little weird.

If anyone has played around with a similar approach or can lend advice then I look forward to hearing it.

paul.buerkner · March 13, 2019, 7:58am

See the details section of ?brm_multiple or vignette(“brms_missings”) for explanation.

haututu · March 13, 2019, 8:21am

Well… I’m an idiot - Thanks heaps I completely overlooked that.

Topic		Replies	Views
Brm_multiple not converging though separate brm models do brms	3	1279	May 7, 2019
How to average repeated brms models into one, to please reviewer #2? Modeling	8	545	February 2, 2023
Parallel model fitting in brms? brms	3	1641	November 27, 2018
Huge error bars with brms_multiple Modeling	10	440	March 5, 2023
Is model averaging using pp_average.brmsfit a good way out of fitting very large datasets? brms	2	605	April 24, 2022

Splitting large datasets to increase efficiency

Related topics