How to deal with larg dataset in the context of bayesian modeling

Hi all,

I’m working with a larg dataset of approximately 6,000 participants who rated about 60 ordinal items and performed a task yielding a continuous predictor. I want to run a probit model using brm function with the following structure:

Rating ~ (1|item) + (1|subject) + z.continuous_predictor + gender + z.age

However, the model runs very slowly. I’m currently using cmdstanr and considering options to accelerate the process:

  1. Using the pathfinder algorithm instead of MCMC.
  2. Splitting the 6,000 subjects into groups of 100 and running the model using brm_multiple.

Are these options reasonable? Do you have any additional suggestions?

You can implement within-chain parallelization across subjects with reduce_sum():

https://mc-stan.org/docs/stan-users-guide/parallelization.html#reduce-sum

This will only take advantage of multiple CPU on the same node or machine. You will need map_rect() if you want to use a cluster with multiple nodes:

https://mc-stan.org/docs/stan-users-guide/parallelization.html#map-rect

2 Likes

Thank you, I have 8 CPU cores, should I prefer this within chain parallelization over using 8 chains that run in parallel?
If yes, how should I choose the exact form of within chain parallelization?

This would add within chain parallelization to across chain parallelization according to a scheduler, though you might not see appreciable gains with 8 CPU – you’d just indicate how many chains you want to run in parallel as per usual. However, I don’t think I recommend breaking the data up into chunks, especially not as a substitute for running multiple chains the native way – it shares the same 8 CPU whether you run it in chunks or in one program.