Splitting large datasets to increase efficiency

I have some massive datasets (> 1mill rows).
In order to run them more quickly I thought of using the brm_multiple() function to run subsets of the data then just average everything together.

It looks like each subset is running nicely and the combined coefficients are almost spot on with the model run as a single dataset. However, it reports that it has not converged and the effective sample is incredibly low.

library(brms)

# Simulate data
df <- data.frame(
  outcome = c(rep(0, 5000), rep(1, 5000)),
  predictor = c(rnorm(5000, 5, 5), rnorm(5000, 10, 5)),
  df_subset = sample(1:5, 10000, replace=TRUE)
)

# Run model with single dataset
test.all <- brm(outcome ~ predictor,
                         family = bernoulli(),
                         cores = 3,
                         chains = 3,
                         data = df)

image

# Split dataset based on subset indicator (five groups)
df_split <- split(df, f = df$df_subset)

# Run each subset without combining to check it is performing correctly
test.multiple <- brm_multiple(outcome ~ predictor,
                              family = bernoulli(),
                              cores = 3,
                              chains = 3,
                              data = df_split,
                              combine = FALSE)

Trust me these were in a similar ball-park as above.

# Combine to have a single model fit
test.multiple.combine <- brm_multiple(outcome ~ predictor,
                                    family = bernoulli(),
                                    cores = 3,
                                    chains = 3,
                                    data = df_split)
* brms Version: 2.7

image

So it seems to work okay but just looks a little weird.

If anyone has played around with a similar approach or can lend advice then I look forward to hearing it.

See the details section of ?brm_multiple or vignette(“brms_missings”) for explanation.

1 Like

Well… I’m an idiot - Thanks heaps I completely overlooked that.