Trouble with k-fold parallelization

ZF13 · November 2, 2021, 10:09pm

I am having trouble with parallelizing brms::kfold.

I ran a model (using brm) with chains = 4 and cores = 4 that took ~ 17 minutes.

I then ran:

k <- brms::kfold(fit, K = 10, folds = "stratified", group = "group",
                 chains = 4, cores = 4)

As specified here (Running folds in parallel on windows using kfold? · Issue #593 · paul-buerkner/brms · GitHub).

Which I assume should run the modeIs the same way as the original model (between chain parallelization since future is not used, compared to either within chain or between fold parallelization): kfold.brmsfit function - RDocumentation). I get information saying “fitting model 1 of 10”, great. I expect each model to take about the same amount of time as the original but the second model does not start fitting until > 1 hour later.

I have also attempted

fit <- brms::add_criterion(fit, "kfold", K = 10, folds = "stratified", 
                           group = "group", chains = 4, cores = 4)

with the same result.

As well as

options(mc.cores = 4)
k_grp <- loo::kfold_split_stratified(K = 10, x = df$group)
k <- rstanarm::kfold(fit, folds = k_grp, cores = 1)

as suggested here: K-fold cross-validation — kfold.stanreg • rstanarm, to fit models in sequential using 4 cores for each model (1 per a chain) which I believe was done with my original model (although with no chains argument, I am curious how this is handled) which took ~ 19 mins on average per a model.

I see that brms::kfold uses the loo::kfold_helpers when fold and group are specified but is it using rstanarm for the actual fitting? It would be nice to stay within brms if possible.

As a side note, there appears to be some disagreement between using cores for speeding up computation time between brms and rstanarm.

brms says (https://cran.r-project.org/web/packages/brms/brms.pdf)
“Number of cores to use when executing the chains in parallel, which defaults to
1 but we recommend setting the mc.cores option to be as many processors as
the hardware and RAM allow (up to the number of chains).”

wheras rstanarm says (K-fold cross-validation — kfold.stanreg • rstanarm)
" The Markov chains for each model will be run sequentially. This will often be the most efficient option, especially if many cores are available, but in some cases it may be preferable to fit the K models sequentially and instead use the cores for the Markov chains."

I understand that one is related to fitting the model and one is doing k-fold but I would think the ideas should line up since k-fold is just fitting the model multiple times. According to Running brms models with within-chain parallelization, it is a bit more nuanced than sequentially often being most efficient (specifically “Within-chain parallelization is less efficient than between-chain parallelization”). I may be misinterpreting this though.

Any help would be greatly appreciated.

Operating System: Windows 10
brms: 2.16.1
loo: 2.4.1
rstanarm: 2.21.1

jsocolar · November 3, 2021, 3:45am

It’s possible (albeit perhaps unlikely?) for the folds to take much longer than the full model if the reduction in data results in a less nice posterior geometry that is less constrained by data and therefore includes the long tails of weak priors. In particular, the folds could require higher treedepths in the HMC trajectories, leading to big slowdowns. If this is indeed what is happening, you can verify by checking two things. First, you should see your CPU usage spike to a level indicating that 4 cores are indeed getting used to fit the folds. If not, then your problem is that for some reason the parallelization isn’t working as expected. Second, if you save the fits from the folds, you should see higher treedepths there than in the full model.

The point is that if you’re fitting the model multiple times then you have the option of deciding between running the chains from each fold sequentially, and parallelizing over the different folds, or running the different folds sequentially and parallelizing over the different chains within each fold. By running the chains for each fold sequentially you ensure that each core is working all the time. No core ever sits idle while it waits for chains from the same fold to finish up on other cores.

This is about functionality to parallelize the execution of the computation necessary to solve the HMC trajectory, and is unrelated to decisions about whether to parallelize over folds (with sequential chains within folds) or to parallelize over chains (with sequential folds). That is, within-chain parallelization is about parallelizing the computation that occurs within a single chain that is part of a single fold. In the context of k-fold cross-validation, you should never consider within-chain parallelization unless the number of cores available is greater than the product of the number of folds times the number of chains that you run per fold.

ZF13 · November 4, 2021, 12:32am

@jsocolar thank you for your reply.

Unfortunately I am using a remotely accessed computer that I can not look at the task manager for running these models. I will try on my local machine to see if the CPU usage is spiking. I have not actually aloud the brms call to finish in order to look at the treedepths of the folds because of how long it is taking. When I have ~ 10 hours without any models running, I will try to let the function run in order to look at this (I am finishing up my thesis so it might be sometime).

It’s possible (albeit perhaps unlikely?) for the folds to take much longer than the full model if the reduction in data results in a less nice posterior geometry that is less constrained by data and therefore includes the long tails of weak priors. In particular, the folds could require higher treedepths in the HMC trajectories, leading to big slowdowns.

If this is the case, do you have any thoughts as to why the fold speeds are comparable to that of the original model when using rstanarm?

In the context of k-fold cross-validation, you should never consider within-chain parallelization unless the number of cores available is greater than the product of the number of folds times the number of chains that you run per fold.

This is great information that I was not aware of.

As I am sure you can tell, I am not well versed in the workings of parallelization. Am I correct in understanding these are the types of parallelization that can occur:

Use 1 core for each chain within a fold, run folds sequentially (between chains within the same fold)
Use multiple cores per a chain, run chains sequentially (within chain)
Use 1 core for each > 1 folds and subsequently that 1 core per a chain within that fold, with chains within a fold run sequentially (between chains not in the same fold and across folds)
Use > 1 core for each > 1 folds and 1 core per a chain within a fold, with chains within a fold run at the same time (between chains within a fold and across folds)
Use > 1 core for each > 1 folds and > 1 core per a chain within a fold, with chains within a fold run sequentially (within chains and across folds)

If this is correct, is #1 likely the best option due to not having a substantial number of cores? I have access to a HPC with 16 cores. Do you think trying either #3 (seems computationally similar to #1) or #4 may be beneficial?

Thank you for your help.

jsocolar · November 4, 2021, 1:30am

So I don’t know that this is really what’s happening, but if it is what is happening then this behavior with rstanarm might be expected. By default rstanarm uses stronger priors on regression coefficients than brms does. If this is what is going on, then you should see similar behavior from rstanarm and brms if you take care to specify equivalent priors.

I think we can simplify this typology a little bit. There are two major types of parallelization that can occur.

Use one core per chain
Use multiple cores per chain.

(1) makes very efficient use of your resources, and should speed up the execution time by a factor of roughly N, where N is the number of cores you have available. (2) can give additional speedup, but almost always by a factor much less than N. So you should only think about (2) if you have more cores available than the total number of chains that you want to run (across all folds). So if you have 10 folds and 4 chains per fold, you could begin to think about (2) if you have more than 40 cores, but otherwise don’t worry about it.

Now within (1) there are really three options.

Parallelize over the individual chains, without regard for what fold they belong to. This would be efficient, but you would have to manually reassemble the different posterior chains into the appropriate draws objects so that chains derived from the same fold are grouped together for assessing convergence and computing ELPD.
Do different folds in parallel. On each core, do the 4 (or however many) chains that you are running per fold sequentially. This should also be efficient, especially if the number of folds that you need to run is an integer multiple of the number of cores you have available. It will take a bit longer than (1) if (but only if) some folds finish much faster than others. If the core i finishes its job before core j starts is final chain, then you could have made things go faster by putting the final chain from j on i instead.
Do the multiple chains within each fold in parallel. That is, if you have 4 cores and want to run 4 chains per fold, use all 4 cores to do the first fold, then use all four cores for the second fold, and so forth. This is likely to be marginally less efficient than the above options, because none of the cores will advance to the next fold until all of the cores finish the current fold. (If you implemented the same scheme, but in such a way that the cores didn’t wait for their brethren to finish before moving on to the next fold, you would have implemented something more like option (1) above).

ZF13 · November 4, 2021, 3:11am

Thank you for the clarification, this all makes sense.

I was unaware that rstanarm uses stronger priors than brms. Since I ran the original model in brms using the default priors, do I need to worry that I am running kfold in rstanarm? I was planning on still using the estimates derived from brm for interpretation and using the kfold elpd to compare different models.

Additionally, since there is no chains argument for rstanarm::kfold (that I am aware of) such as there is with brms::kfold, are you aware how this is handled? My thinking was that since the run times are similar for each fold compared to the original model that the same number of chains are being run (I see now that this could be due to the difference in priors between the packages).

In terms of parallelizing, I’m assuming that 1.1 from your outline is not implemented in any current functions/packages?

If not, it sounds like implementing 1.2 is my best bet (currently I believe I am using 1.3, please correct me if I am wrong). Would I need to use the future package along with brms::kfold to implement this or if staying within rstanarm using the cores argument within kfold?

If I have more cores than folds, in 1.2, how are the extra cores used (ex: 10 folds, 16 cores)?

jsocolar · November 4, 2021, 3:37am

Yeah, I’m reasonably sure that this will yield different priors for most models. I’d do everything in either brms or rstanarm; I don’t see a reason to mix and match here. But more importantly, I’d take some time to think about the prior model and choose a set of reasonable non-default priors.

I don’t know how parallelization is handled with rstanarm::kfold.

Right. Moreover, I think this is all a bit of a distraction. The efficiency differences are going to be pretty minor. I strongly recommend just going with whatever parallelization scheme is intuitive for you to implement.

In this scenario, you would probably be better off with some other parallelization scheme. Or alternatively you could do 16-fold crossvalidation instead of 10-fold, with no penalty in terms of the wall-time. This would also result in each fold seeing more data, so if your problem is actually due to treedepth issues due to insufficient/weak data, increasing the number of folds could potentially help. Again, I’m really not sure if that was ever related to your problem.

ZF13 · November 4, 2021, 4:44am

Thank you for all your help.

Ideally I would like to stay within brms as I am not proficient with rstanarm. From the literature I have read, I am content with the default priors for my distribution (dirichlet).

Sounds like I need to trouble shoot brms::kfold further to see if the increased run time is due to the potential issues you raised. I will try to do that and report back.

For future reference should either:

k_fit <- brms::kfold(fit, K = 10, folds = "stratified", group = "group", chains = 4, cores = 4)

or

fit2 <- brms::add_criterion(fit, "kfold", K = 10, folds = "stratified", group = "group", chains = 4, cores = 4)

parallelize the k-fold process in brms? I want to make sure I am specifying everything correctly when troubleshooting.

Topic		Replies	Views
Loo::kfold How is the task parallelized? rstanarm loo	5	1219	March 25, 2019
Paralellizing brms across both chains and fits brms performance , paralellization	3	900	December 22, 2022
Number of Chains Necessary when Performing K-fold Cross Validation Modeling performance , loo , cmdstanr , brms	5	1055	February 14, 2022
Brm_multiple & Parallel computation Modeling brms	2	1013	September 26, 2022
Multiple attemps to run k-fold cross-validation fail with brms brms fitting-issues , cmdstanr , cross-validation , brms	0	785	December 14, 2021

Trouble with k-fold parallelization

Related topics