How to deal with larg dataset in the context of bayesian modeling

Shai_Shachar · September 22, 2024, 4:55am

Hi all,

I’m working with a larg dataset of approximately 6,000 participants who rated about 60 ordinal items and performed a task yielding a continuous predictor. I want to run a probit model using brm function with the following structure:

Rating ~ (1|item) + (1|subject) + z.continuous_predictor + gender + z.age

However, the model runs very slowly. I’m currently using cmdstanr and considering options to accelerate the process:

Using the pathfinder algorithm instead of MCMC.
Splitting the 6,000 subjects into groups of 100 and running the model using brm_multiple.

Are these options reasonable? Do you have any additional suggestions?

Corey.Plate · September 22, 2024, 7:57pm

You can implement within-chain parallelization across subjects with reduce_sum():

https://mc-stan.org/docs/stan-users-guide/parallelization.html#reduce-sum

This will only take advantage of multiple CPU on the same node or machine. You will need map_rect() if you want to use a cluster with multiple nodes:

https://mc-stan.org/docs/stan-users-guide/parallelization.html#map-rect

Shai_Shachar · September 23, 2024, 9:48am

Thank you, I have 8 CPU cores, should I prefer this within chain parallelization over using 8 chains that run in parallel?
If yes, how should I choose the exact form of within chain parallelization?

Corey.Plate · September 23, 2024, 12:28pm

This would add within chain parallelization to across chain parallelization according to a scheduler, though you might not see appreciable gains with 8 CPU – you’d just indicate how many chains you want to run in parallel as per usual. However, I don’t think I recommend breaking the data up into chunks, especially not as a substitute for running multiple chains the native way – it shares the same 8 CPU whether you run it in chunks or in one program.

Topic		Replies	Views
Possible to quasi-automatically implement within-chain parallelization in cmdrstan with brms? Modeling cmdstanr , brms	3	578	August 13, 2022
Within-chain parallelization misbehaves with brms in a model with measurement errors brms	21	2083	November 1, 2020
Is it possible to run Bayesian hierarchical model with 10million observations? General rstan , hierarchical-model , brms	20	3123	June 28, 2021
Tips for speeding up a model fit...I cut it off at 10 days Modeling fitting-issues , brms	4	448	September 1, 2023
Possible to fit two models at once? General techniques , specification , performance	3	929	March 29, 2023

How to deal with larg dataset in the context of bayesian modeling

Related topics