Questions About Approximate Leave-One-Out Validation for Large Models

While fitting some of the growth curve models (see Fairbrother 2014) for one of the chapters in my dissertation I seem to have run into an issue of the models being too large (not sure if this is due to the fact I have 350,000 observations across 65 countries and 249 surveys or because I have 7000 post-warmup draws per chain) to perform approximate leave one out validation in memory. The models are fit via brms as follows

# Model 3: Linear Trend and Varying Time Slope with Respondent Level Predictors
hlogit_3 <- bf(
    support ~ female_wi + educ + age_cat + soc_pctfemleg_wi + soc_polyarchy_wi +
      time_gmc + (time_gmc | country) + (1 | cntry_prj_year),
  family = bernoulli(link = "logit"),
  decomp = "QR"

# Specify Priors for model 3
hlogit_3_priors <-
  prior(student_t(3, 0, 2.5), class = "Intercept") +
  prior(normal(0, 3), class = "b") +
  prior(exponential(0.5), class = "sd") +
  prior(lkj(3), class = "cor")

# Fit the Model using Within-Chain Threading (6 chains, 10k iterations)
fit_hlogit_3 <- brm(
  formula = hlogit_3,
  prior = hlogit_3_priors,
  data = model_df,
  cores = 12,
  chains = 6,
  iter = 10000,
  warmup = 3000,
  threads = threading(threads = 2, grainsize = 100),
  save_pars = save_pars(all = TRUE),
  seed = 123,
  backend = "cmdstanr",
  save_model = "analyses/models/Electoral Democracies/Stan/HLogit_3.stan",
  file = "analyses/models/Electoral Democracies/HLogit_3"

# Add LOO and Bayes R2 to the Model
fit_hlogit_3 <- add_criterion(
  model_name = "Societal Growth Curve Model 3",
  criterion = c("loo", "bayes_R2"),
  cores = 1,
  file = "analyses/models/Electoral Democracies/HLogit_3",

The model converges without any issues but when I try to add LOO and Bayes R^{2} I immediately get the error

Error: cannot allocate vector of size 109.3 Gb

I tried leave-one-out subsampling parallelized across 12 cores (CPU is a Ryzen 9 5900X)

# Add LOO to the Model
fit_hlogit_3 <- add_criterion(
  model_name = "Societal Growth Curve Model 3",
  criterion = c("loo_subsample"),
  observations = 100000,
  cores = 12,
  file = "analyses/models/Electoral Democracies/HLogit_3",

but for some reason it takes an excruciatingly long time regardless of what the observations argument is set to and I have no idea why (as of posting this it has been running for longer than it took to fit the model). Is there some reason it takes so long and are there any workarounds that would make this more computationally tractable? Any suggestions are greatly appreciated because I’m hoping to avoid having to upgrade my memory until the new DDR5 standard is released.

Tagging @avehtari, @andrewgelman, and @jonah on this one because loo


I think I managed to find a solution to this issue by setting ndraws to a value less than 5000 and cores = 1 which seems to be the maximum that I can compute in memory with 64GB. So for anyone else who runs into this issue, the following worked for me:

# Add LOO, WAIC, and Bayes R2 to the Model
fit_hlogit_null <- add_criterion(
  model_name = "Multilevel Logistic Regression Null Model",
  criterion = c("loo", "waic", "bayes_R2"),
  ndraws = 4000, # I get errors if I set this higher than ~4600 or so
  cores = 1,
  file = "analyses/models/Electoral Democracies/HLogit_Null",

Also if memory is the issue, then running loo with the pointwise=TRUE option is the way to go. It’s less memory intensive but at the cost of being much slower

Well, the code I posted above just failed after an hour with Error: cannot allocate vector of size 10.4 Gb, so I’ll give pointwise=TRUE. It can’t take any longer than the two full days loo_subsample had been running for. Thanks

Why 7000 draws per chain? Usually, 1000 warmup and 1000 saved iterations is more than enuf.

I had originally been using 2,000 warmup and 2,000 saved but I’m using two different metrics for model evaluation/comparison, loo-cv and inclusion bayes factors via bayestestR, and apparently the bridge sampling algorithm tends to be unstable without a pretty large number of post-warmup draws (bayestestR throws a warning if you have less than 40,000).

Very cool! How much time the sampling takes? This is one that kind of models that we can eventually make Stan faster, so it would be interesting to know the current speed.

Yep, that is 42000 x 350000 x 8 bytes, which is the size of the log_lik matrix you tried to create.

Did you try with something like observations = 100 ? 100000 is certainly overkill for subsampling, as the idea is that just a small proportion of the total number of observations would be needed. The one challenge is that subsampling is also computing something for all observations which can be slower than what you would expect.

And this is probably still more than enough for loo and bayes_R2.

Computing waic if you have computed loo is redundant.

Ouch, Probably at that line R has the original big matrix, at least one temporary copy and space reserved for the outcome. So you could try with ndraws=1000, which is plenty, and then it’s likely it fits in your memory.

Also, you could consider using K-fold-CV as then you would know that the computation time is very close to K times your original computation time. You might then consider using structured data division, so that you can estimate how well you can generalize between countries and surveys.

Furthermore, as your number of observations is very large, it is possible that you don’t need benefit much from LOO-CV (leaving out one observation is probably having a very small effect). Did you get bayes_R2 result already? If the distribution of bayes_R2 is narrow, then LOO-CV is probably not needed (of course would be even more clear when knowing LOO estimate p_loo)

1 Like

I’ve been fitting these using only the CPU because brms doesn’t perform well for multilevel models with OpenCL (this is due brms’ use of a for loop in the likelihood section of the model block), but I’ve managed to shave off a few hours by using within-chain threading and QR decomposition for the model above (which isn’t technically the full specification, because I still need to add in the group means and time interactions but I’ve been building this in stages to make any debugging easier) takes about seven hours in total, ~2 for warmup and ~5.5 for sampling.

  chain_id warmup sampling total
        1   6544    20412  26956
        2   6512    20243  26755
        3   6852    20472  27324
        4   6892    20439  27330
        5   6937    20447  27384
        6   6398    20428  26826

I’ve done some testing with a vectorized version of the model and GPU based computation shaves off about two or three hours but those gains will likely be more sizable once support for OpenCL indexing in PR 3053 is merged. The downside is that you can’t really use any of the convenient post-processing functions that interfaces like brms offer if you fit the model with OpenCL in cmdstanr.

I did finally manage to get loo and Bayes R^2 to calculate in memory with the following settings

# Add LOO, WAIC, and Bayes R2 to the Model
fit_hlogit_3 <- add_criterion(
  criterion = c("loo", "bayes_R2"),
  ndraws = 3000, # ~3400 or so seems to be the ceiling with 64GB of Memory
  cores = 1,
  file = "analyses/models/Electoral Democracies/HLogit_3",


Computed from 3000 by 349228 log-likelihood matrix

          Estimate    SE
elpd_loo -205840.6 245.5
p_loo        250.5   0.5
looic     411681.2 490.9
Monte Carlo SE of elpd_loo is 0.3.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Bayes R^{2} is 0.1908 with a 95% interval of [0.1889, 0.1927].

As far as K-fold validation, I’ve given some thought to it because it could provide a good way to evaluate some of the assumptions about measurement equivalence across regions and the possibility that there may be some systematic differences between the two categories of democracy (electoral and liberal) that I base my case selection on. In the latter case, I fit the model with just the liberal democracies (N=~207,000) and use the electoral democracies as a held out sample (N=132,000) but k-fold validation would be a great approach to comparing predictive performance across regions. I’m certainly open to any suggestions as to what approach may work best on that front.

1 Like

Thanks for the additional details

If you would drop bridge sampling, 3000 iterations would be plenty, and you could run just 500 post warmup iterations per chain (and maybe you could use shorter warmup, too) and you would get a lot faster sampling.

3000 draws is more than enough. p_loo compared to the number of observations and elpd_loo is very small and you could just compute posterior lpd.

There is not much use to report more than two digits, so you could report Bayes-R^2=0.19. If you would compute LOO-R^2, I’m very confident that you would also get 0.19 (other digits might differ). So here the full posterior and LOO-posterior are so similar that you can use just the full posterior.

But here the differences can be bigger between the full and, e.g., leave-region-out as you would be removing more data out.


Alright, thank you. This is probably the approach I’ll take since leave-one-region-out k-fold validation will be a more meaningful approach to evaluating the differences. I need to try tweaking the number of warmup iterations and see what the minimum number I can get away with is without running into divergences/max tree-depth warnings. I had set it to 3,000 because I knew that would be more than enough based on a paired down version of the model (1250 warmup per chain was the lowest number that yielded no warnings in that case) but I might be able to get away with 1,500-2,000 or so without any issues and 500 post-warmup would certainly be faster (I could also technically use 8 chains instead of 6 since I have 12 physical cores and 24 threads). Thank you very much for the suggestions.

1 Like

I have a breif follow up question on this regarding k-fold cross-validation.

I had planned on using six fold cross validation by splitting the data into geographic regions but I’m wondering if it’s an issue that this results in somewhat severely unbalanced subsets (they range from 4% of the data to 37%).

# Split the data into five folds by geographic region
k_regions <- loo::kfold_split_grouped(K = 6, x = model_df$region)

# Check balance of the data
 na_if(table(model_df$region, k_regions)/nrow(model_df), 0)*100
                                         1      2      3      4      5      6
  Eastern Europe and Central Asia   33.122                                   
  Latin America and the Caribbean          15.167                            
  The Middle East and Nother Africa                4.003                     
  Sub-Saharan Africa                                      5.984              
  Western Europe and North America                              37.303       
  Asia and Pacific                                                      4.422

Given the number of levels in the random effects, full leave one group out validation is computationally intractable (would require fitting the models 63 times each) so that’s not really an option and the alternative would be to split the data randomly into five or so groups of countries which would give roughly balanced held out sets.

An alternative might be simple held out validation, using the most recent wave of the data as the test set which has the benefit of being easily justifiable on theoretical grounds and is by far the most computationally efficient option.

I’m wondering what the best option might be here?

1 Like

I think your analysis of the situation is good. I would probably go with the balanced random design, but the simple-heldout-most-recent-wave is a sensible choice, too.


For the time being I went with a final wave per country approach that entails holding out the last wave for every country that had at least three waves since it has some nice properties (i.e., it is roughly 25% of the data and is about evenly divided between the two survey projects). I caved and upgraded to 128GB of memory so I’ll see if that makes the random five fold cross-validation a bit more computationally tractable (should allow for more efficient parallelization via future).

For the purposes of simple held out validation am I correct in assuming that in the context of brms I just need to fit a seperate model with training data and then run loo and pass the test set to the newdata argument or is there more to it than that?

In case of help out data, you don’t need loo. The vignette Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo discusses hold-out and K-fold-CV, and should help you to see what you need from brms.


I found this page now when reading another Stan discourse page. loo_subsampling() should be way quicker, and as Aki says, it should be enough with 100-1000 observations. At least in the first step. I’m happy to help out here to get this to work. Would it be possible for you to do short profiling of what takes time in loo_subsampling?


1 Like