How to speed up `brms::loo_subsample()` for large models

I am trying to perform model comparisons of 3 hierarchical models (Poisson family, log link) fit on a dataset with approximately 180k observations.

Not sure if this is helpful, but the simplest model is of the form:

formula0 <- case01 ~ -1 + var0 + prop_total123 +
    # Random intercepts for each strata
    (1|strata) +
    # Random slopes for each series
    (0 + prop_total123 | series)

And the most complex model is of the form:

formula2 <- case01 ~ -1 + var0 + prop_1 + prop_2 + prop_3
    # Interactions
    2levelfactor*prop_1 +
    2levelfactor*prop_2 +
    2levelfactor*prop_3 +
    # Random intercepts for each strata (step in series)
    (1|strata) +
    # Random slopes for each series
    (0 + prop_1 | series) +
    (0 + prop_2 | series) +
    (0 + prop_3 | series)

With the size of my dataset, fitting these models on a 32-core server with 128 gb of RAM takes anywhere from 5 to 12 hours depending on iterations/cores/threads… No error messages appear, diagnostic plots check out… but when I try to compare models I’ve hit a wall. When I use brms::add_criterion(model0, "loo"), the function maxes out 128 gb of RAM and crashes R. I then found brms::loo_subsample() which is using little RAM but has yet to spit out an answer after running for about 30 hours.

I’m new to brms and stan, so perhaps my hopes for this to work are ill-conceived? Is it unrealistic to get this to work on a model this size with this much complexity? Are there any options to speed things up? I’ve had a quick look at loo::loo_subsample() but the inputs for that function seem beyond my current knowledge of object innards. Is there any obvious info I’ve overlooked that might be helpful for you to understand what’s happening? I ommitted a fully reproducible example because fo the sheer size of the dataset… but I can always take a crack at it if required.

  • Operating System: Linux Mint 20 Cinnamon
  • brms Version: 2.18.0
  • R Version: 4.2.1

After some more reading on this forum and elsewhere (such as here and here), it seems like it might not be feasible to compute LOO when there are so many observations, as dropping a single observation out of hundreds of thousands is barely perceptible. The first link above has a good discussion on this subject.

Consequently, I have turned my efforts towards brms::kfold(), which also seems more intuitive to me as there are natural groupings by individuals within my dataset that I can seperate into more or less equal folds. However, I have some new questions regarding its use:

  1. Am I correct in my understanding that brms::kfold() is just another way to approximate the PSIS LOO information criterion? Can I call the result the same thing, just estimated a different way?
  2. I’ve come to realize brms::loo_subsample() can pass arguments to loo::loo_subsample(), of interest are the arguments that can help speed things up, like draws, observations, and cores. However, it’s not clear from the brms::kfold() docs that I can do those same things. From what I can gather, it seems I can pass chains/cores/threads arguments to brms::brm()… So what exactly does future_args accomplish, is it to run the folds in parallel while chains/cores/threads applies to within each fold? Is there a rule fo thumb of the most efficient way to set this up for large models?

Thank you for your time, I know I’ve diverged a bit from the title of the question, so I can post another thread instead… But it seems like a natural progression that other users might end up here looking to solve slow LOO computations and be redirected to kfolds.

The memory issue is probably because it tries to parallelize the computation, but even if that would not be an issue it would be slow

I don’t think it should be that slow. I’m pinging @paul.buerkner if he has time to comment or check what could explain thus.

It is also possible that you don’t need cross-validation at all. As you do have a hierarchical model, whether corss-validation is needed depends on how many observations you have per group specific parameters or whether you are interested in predictions for new groups.

It is likely that K-fold-CV with a natural grouping is easier than to get LOO to be fast enough.

See CV-FAQ What are the parts of cross-validation? and CV-FAQ How are LOOIC and elpd_loo related?

@paul.buerkner ?

I don’t know what is causing these issues in loo_subsample without a reprex unfortunately, which may be difficulat given that the problems you have are related to your model being really large. But I agree, loo_subsample should be faster. Can you get waic() to work? That could at least be a computational more feasible approximation.

With regard to your other question, indeed future allows you to run CV-folds in parallel, so you can choose you to use your multiple CPUs. For parallelization across folds or for parallelization within folds (or both).

Following up on this, and kind of building on Aki’s suggestions, I spent a good deal of time wrangling with the issue of memory limits and computational intractability of k-fold CV for my purposes in this specific project.

In my application, I have a four level model with 360,000 observation, 63 countries, 254 country-survey-years, and a crossed random effect for five year birth cohorts. After some extensive testing it turns out, you don’t need the full posterior draws for loo just a small subset of them. The following code runs on my system (Ryzen 5900X and 128GB of memory) without any issues.

# Add LOO-CV to each Model
 soc_sgc_cond_linvar_fits <- add_criterion(
    criterion = "loo",
    cores = 1,
    seed = 12345,
    ndraws = 3000

It sounds like you have fewer observations albeit a more complex model, but you should be able to adapt this to your purpose.

1 Like

Thanks for all your thoughts and suggestions. As a follow up, here is what I’ve done/tried so far:

  1. K-fold cross validation with folds stratified across groups I opted for “stratified” instead of “grouped” folds because after further reading it seems the latter drops terms for the random effects making it more difficult to interpret the standard errors for the IC. Eventually though, I suppose that “grouped” folds would be more desirable because conceptually, for my study, it does make sense to evaluate model performance on new groups (individuals). However, I’m uncertain how to do this properly and suspect it may further push me into even more uncertain terrain given my current knowledge… so for now, I am already very pleased being able to compare the relative performance of the 3 models using folds “stratified” across groups. The brms::kfold() function was also very RAM hungry, but I reduced available workers to 4 and 2 depending on model complexity and managed to estimate kfoldic within about 1 day for each model. My only follow up question about brms::kfold() is with regards to @avehtari following statements:

Would it be possible to clarify this statement? My primary objective is essentially to comapre the relative performance of my 3 models using some IC metric, given that kfold is easier to speed up than loo, it seems I should gravitate towards kfold, no? Can you elaborate on what you mean when you state that I might not need CV at all? What should I use instead if I want to compare the performance of the 3 models, and what are the rules of thumb regarding number of observations per group parameter, and what are the downstream model comparison tools in cases where there are sufficient or insufficient observations per parameter? If I do wish to estimate kfoldic using “grouped” folds, are there resources you could point me to that would explain how to do so correctly.

  1. WAIC Following @paul.buerkner suggestions, I also tried
    brms::waic() which the computer succesfully estimates much faster than brms::kfold() (approx. 10-15 minutes). However, brms::waic() does complain that between 0 and 4 % of p_waic estimates are greater than 0.4, so I am reluctant to rely entirely on these estimates. However, when i compare kfoldic and waic, they are very very close, and they lead to the same conclusions regarding how the 3 models perform relative to each other.

  2. LOO Following @ajnafa suggestions, I tried to run brms::loo_subsample() with 1 core and 5k draws. We’ll see how long it takes, so far 3 hours and counting… If I can get brms::loo_subsample() to work in a more timely maner, would this be a preferable model comparison metric compared to kfoldic and waic?

Can you show the results of K(10?-)fold-CV and WAIC when comparing the 3 models? If the differences are substantial enough, this should provide enough evidence.

With regard to loo_subsample, brms itself does very little that should take time. Certainly not more than WAIC needs. Perhaps the efficiency problems are in the loo_subsample implementation of the loo package itself? Tagging @mans_magnusson

Hi! loo_subsample() was designed for these situations and should work way faster in these settings. The memory aspect indicates that it seems to work, at least to some degree. Could you profile the loo_subsample() code? Then we could see what aspect takes time. I’m happy to try to fix this since this should not be the case. I expect it to produce an estimate within a few seconds to minutes. Especially for the simple model.


Screenshot from 2022-10-21 11-16-14

@mans_magnusson I did launch brms::loo_subsample() yesterday and it has been running a little over 24h and still nothing so I have cancelled that job and reduced ndraws to 500 for profiling, hopefully it’ll eventually give us some answers, I’ll report back here as soon as that job is done. Thanks!

That’s great! Could you put your call here or create reproducible code I could test to get the same result?

The results of KFOLD and WAIC indeed paint a very clear picture that model 1 is to be preferred over model 2 and 3.

You mean model 3 to be preferred over 1 and 2, correct?

yes I am sorry. I meant the best model over the others.

Unfortunately it seems that profvis({}) and rstudio do not like profiling large models/functions, after 3 profiling attempts, all I get is a white page and a hanging rstudio. I am not quite sure how to proceed. I am not at liberty to share the original dataset (a spatio-ecological dataset about an endangered species), and recreating dummy data with similar relationships would require an in-depth correlated random walk simulation which I just do not currently have time for. The best I can do for now is create a simple set of randomly distributed variables that recreate the hierchichal relationships in my original data, but I wonder if this might lead to different outcomes in terms of the speed in model fitting and estimation of metrics like looic. Hope this helps!


# Load packages

# Set cmdstanr file directory command
options(cmdstanr_write_stan_file_dir = "./")

# Create dummy dataset
dummy_data <- tidyr::expand_grid(
  series = 1:250,
  series_events = 1:20,
  time_step = 1:30
) %>% 
    # Response variable
    case01 = dplyr::if_else(series_events == 1, 1, 0),
    # Strata
    strata = as.numeric(forcats::as_factor(paste0(series, "_", time_step))),
    # Prop
    prop_total123 = runif(dplyr::n()),
    # var0
    var0 = rexp(dplyr::n(), 0.001)

# Define dummy priors
dummy_priors <- c(
  prior(normal(0, 1), class = b, coef = var0),
  prior(normal(0, 10), class = b, coef = prop_total123),
  prior(normal(0, 10000), class = sd, coef = Intercept, group = strata),
  prior(normal(0, 10), class = sd)

# Fit model
nullmod <- brms::brm(
  formula = case01 ~
    -1 + var0 + prop_total123 +
    # Random intercepts for each strata
    (1|strata) +
    # Random slopes for each series
    (0 + prop_total123 | series),
  data = dummy_data,
  family = poisson(link = "log"),
  prior = dummy_priors,
  init = 0,
  chains = 10, 
  iter = 2000, 
  warmup = 1000,
  thin = 1,
  cores = 20,
  threads = brms::threading(2),
  backend = "cmdstanr",
  seed = 12345,
  control = list(adapt_delta = 0.999),
  file = "./nullmod"

# As per brms::kfold() docs, avoid model recompilations here
# by restarting R, attaching brms, setting cmdstanr file directory,
# and rerunning brms::brm() to import nullmod

nullmod_loo <- brms::loo_subsample(
  x = nullmod,
  cores = 1,
  seed = 12345,
  ndraws = 500


Does this replicate the behaviour you described above? Ie is slow in the same way?

If you compute log score without cross-validation it is optimistic as the same data is used to fit the model and compute the predictive log score. If the model is reasonably well specified, non-singular, and the number of parameters doesn’t increase with the number of observations, then asymptotically the amount of optimism is close to the total number of parameters (this coincides with the optimism estimate in AIC). Let’s say you have models with 10 and 100 parameters. If we assume the number of observations is big enough we can approximate the effective number of parameters with 10 and 100, and if after using these for adjustment the difference in the log scores is big, then the result with cross-validation would probably be the same. There is no strict threshold of how many observations per parameter you would need, but in many cases 100 observations per parameter is a lot. Even in case of hierarchical models with groups having their own parameters and some of the groups having only a few observations, it is likely that this simple correction is sufficient, and the true effective number of parameters would be smaller than the total number of parameters. Things get more complicated if you are interested in the leave-one-group-out cross-validation instead of LOO-CV. Sorry, I don’t have a good reference explaining this in more detail and with good clarity.

LOO is always preferred over waic. kfold can still be better if you are interested in the leave-one-group-out