Memory requirements to run loo.brmsfit

I have a model with an ordinal outcome (sratio family) with two crossed group effects plus a monotonic predictor and a factor covariate. The number of levels of the group effects are 112 and 491. The model has good Rhat, ESS and no warnings. The model object size is about 72 MB. I run the model with a memory allocation of 32 GB without difficulty

loo(object) yields a R crash. This is repeatable. I assume that this is a memory problem.

bprior <- prior(student_t(3, 0, 2.5), class = 'b')

fit0.brmsfit <- brm(
  formula = as.ordered(Interventions) ~ 
    mo(Risk) + Race + (1 | CensusTract) + (1 | AnonymousProvider),
    data = CuratedPONV.df %>% 
  filter(ProviderCasesCount >= 50 & CensusTractCasesCount >= 25 & Anesthesia != 'MAC'),
  family = sratio(),
  prior = bprior,
  iter = 2000,
  cores = 8,
  chains = 8
)

The data is PHI and can’t be shared.

I tried running the model with fewer iterations and received ESS warnings.

I can obtain a larger memory allocation with a batch job. Can memory requirements be calculated? How much will it take?

Is the function loo_subsample.brmsfit relevant for producing a loo object?

Nathan

Please also provide the following information in addition to your question:

  • Operating System: Centos07
  • brms Version: 2.17.0
  • R: 4.1.1

Model output:

Family: sratio
Links: mu = logit; disc = identity
Formula: as.ordered(Interventions) ~ mo(Risk) + Race + (1 | CensusTract) + (1 | AnonymousProvider)
Data: CuratedPONV.df %>% filter(ProviderCasesCount >= 50 (Number of observations: 33073)
Draws: 8 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 8000

Group-Level Effects:
~AnonymousProvider (Number of levels: 112)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.63 0.04 0.56 0.73 1.00 1079 2295

~CensusTract (Number of levels: 491)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.22 0.01 0.19 0.24 1.00 3540 5614

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] -4.24 0.16 -4.54 -3.93 1.00 1515 3296
Intercept[2] -2.97 0.15 -3.27 -2.66 1.00 1496 3170
Intercept[3] -1.26 0.15 -1.56 -0.95 1.00 1468 3398
Intercept[4] 0.57 0.15 0.26 0.87 1.00 1501 3070
Intercept[5] 0.77 0.16 0.46 1.09 1.00 1658 3466
Intercept[6] 3.63 0.40 2.91 4.48 1.00 6304 4891
RaceUnknown -0.13 0.04 -0.20 -0.06 1.00 11642 6694
RaceWhite 0.22 0.03 0.16 0.27 1.00 10865 6723
moRisk -0.52 0.05 -0.63 -0.43 1.00 3378 3914

Simplex Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
moRisk1[1] 0.24 0.04 0.17 0.31 1.00 3061 4114
moRisk1[2] 0.15 0.02 0.11 0.19 1.00 4250 4384
moRisk1[3] 0.02 0.01 0.00 0.03 1.00 7957 3760
moRisk1[4] 0.03 0.01 0.02 0.04 1.00 7316 6098
moRisk1[5] 0.06 0.01 0.04 0.09 1.00 5498 5180
moRisk1[6] 0.32 0.04 0.25 0.39 1.00 3687 4151
moRisk1[7] 0.18 0.08 0.03 0.32 1.00 3281 2923

Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc 1.00 0.00 1.00 1.00 NA NA NA

Looking forward to your topic!

I was able to obtain large memory allocation. On some models it required 40 GB per core to run loo.

Answered my own question.

That’s a lot for loo. How did you call loo? Are you certain you didn’t use many parallel threads? Can you show the loo output (elpd_loo, p_loo, etc?). With that many observation you could use subsampling-LOO, see brms::loo_subsample and more details in loo::loo_subsample.

You can save memory also by using less post-warmup draws. You seem to have good ESS’s, so less would be sufficient for loo.

I have accessed to a system with lots of cores. So recently I have been using 8 or 16 cores/chains to run brms for faster speed. I have increased the number of iterations each time I get a bulk or tail ESS warning. Is there an interaction between the number of chains and the iterations that defeats this approach for getting speed and adequate ESS? The help page states that the bulk ESS should be at least 100 times the number of chains.

Initially when working with limited memory resources, I was calling loo using a single core with 4 GB per core. R crashed quickly.

With good resources, I used loo with 16 cores. I could monitor realtime memory use. It was 30 to 40 GB per core. Total memory use was 500+ GB.

The loo output for the 2.4 GB brmsfit object is:

Computed from 40000 by 33073 log-likelihood matrix

         Estimate    SE
elpd_loo -44476.4 135.8
p_loo       377.9   3.8
looic     88952.7 271.6
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

I had noticed the brms::loo_subsample function, but didn’t understand how to make it work. The examples in brms documentation show loo_subsample(model object) with no further arguments. I tried that, but didn’t get any loo object.

Nathan

EDIT by Aki: added three ticks ``` before and after the loo output for better formatting

Physical cores or virtual cores? CPUs usually have the double number of virtual cores than physical cores, but in very computational intensive tasks the virtual cores are not helpful and using all of them can decrease the total throughput.

These guidelines are not exact and it’s not possible to have exact guidelines that would apply in every case. Adding more chains is good, and I think we could change that to say total ESS>400 so that with 8 chains ESS>50 per chain would be fine. These thresholds are also quite conservative and depending on the application lower ESS values may be sufficient, see more in How many iterations to run and how many digits to report.

That 40000 is a lot, and 1000 or 4000 would likely to be sufficient, so then that mean 1-4GB per core. As Monte Carlo SE of elpd_loo is 0.1. the MCSE would be negligible even with 1000 draws used for loo.

As p_loo is about 1% of the number of observations, it’s possible that you don’t even need loo, and you could just use log score on the full data (or some other criterion).

tagging @paul.buerkner

With regard to loo_subsample I would need to see a reprex to understand what you tried and why it failed.

I am new to reprex. Apparently a reprex must include the data. This is PHI data and can’t be shared.

Here is the model object:

 fit201.brmsfit
 Family: sratio 
  Links: mu = logit; disc = identity 
Formula: as.ordered(Interventions) ~ mo(Risk) + Race + (1 | AnonymousProvider) 
   Data: tmp.df %>% filter(ProviderCasesCount >= 50 & Censu (Number of observations: 33073) 
  Draws: 8 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Group-Level Effects: 
~AnonymousProvider (Number of levels: 112) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.64      0.05     0.55     0.74 1.01      477      683

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -4.16      0.16    -4.48    -3.87 1.01      951     1901
Intercept[2]    -2.90      0.15    -3.21    -2.61 1.01      936     1742
Intercept[3]    -1.22      0.15    -1.53    -0.92 1.01      934     1957
Intercept[4]     0.57      0.15     0.26     0.87 1.01      936     2029
Intercept[5]     0.76      0.16     0.44     1.07 1.01      985     2058
Intercept[6]     3.57      0.41     2.83     4.42 1.00     4544     4857
RaceUnknown     -0.14      0.04    -0.21    -0.07 1.00     4690     5034
RaceWhite        0.25      0.03     0.19     0.31 1.00     4835     4827
moRisk          -0.53      0.05    -0.63    -0.43 1.01     1362     1302

Simplex Parameters: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
moRisk1[1]     0.24      0.04     0.16     0.31 1.01     1308     2249
moRisk1[2]     0.15      0.02     0.11     0.19 1.01     1824     2269
moRisk1[3]     0.02      0.01     0.00     0.03 1.00     4149     2157
moRisk1[4]     0.03      0.01     0.02     0.04 1.00     4110     4798
moRisk1[5]     0.06      0.01     0.04     0.08 1.00     2511     3533
moRisk1[6]     0.32      0.04     0.26     0.40 1.01     1563     1734
moRisk1[7]     0.19      0.08     0.03     0.32 1.01     1111     1101

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Here is the console window output when attempting to run loo_subsample:

loo_subsample(fit201.brmsfit)
Error in subsample_idxs(estimator = estimator, elpd_loo_approximation = elpd_loo_approx,  : 
  Assertion on 'elpd_loo_approximation' failed: Must be of type 'numeric', not 'character'.
In addition: Warning message:
In mclapply(X = seq_len(N), mc.cores = cores, FUN = lpd_i, llfun,  :
  all scheduled cores encountered errors in user code

Nathan

EDIT (by Aki): added ticks for code blocks for better readability

1 Like

Last time I saw this kind of error, it was due to the R object conversion messing attribute information in the posterior object. Maybe there has been a recent change in brms using posterior package and loo::loo_subsample not handling that correctly?

1 Like

I get the same error as @nlpacestan. I tried with different loo_subsample(..., loo_approximation = ...) but to no avail.

loo_subsample(mix3, cores = 4) gives:

Error in subsample_idxs(estimator = estimator, elpd_loo_approximation = elpd_loo_approx,  : 
  Assertion on 'elpd_loo_approximation' failed: Must be of type 'numeric', not 'character'.
In addition: Warning messages:
1: In parallel::mclapply(X = seq_len(N), FUN = function(i) { :
  all scheduled cores encountered errors in user code
2: In mclapply(X = seq_len(N), mc.cores = cores, FUN = lpd_i, llfun,  :
  all scheduled cores encountered errors in user code

where mix3 is computed using:

mix <- mixture(gaussian, gaussian)
twoNexpsigmaform = bf(y | cens(cen1) + trunc(lb = 1) ~ mu1 + mu2) +
  lf(mu1 ~ 1 | SUBJ) +
  lf(mu2 ~ 1 | SUBJ) +
  nlf(sigma1 ~ A*exp(B*y + C)) +
  nlf(sigma2 ~ A*exp(B*y + C)) +
  lf(A + B + C ~ 1) + 
  set_nl(nl = TRUE)

twoNexpsigmaprior = prior(prior = "dirichlet(1)", class = "theta") +
  prior(prior = "student_t(3, 32, 4.2)", class = "Intercept", 
        dpar = "mu1", lb = 0) +
  prior(prior = "student_t(3, 37, 4.2)", class = "Intercept", 
        dpar = "mu2", lb = 0) +
  prior(prior = "student_t(3, 0, 4.2)", class = "b", nlpar = "A", lb = 0) +  
  prior(prior = "student_t(3, 0, 4.2)", class = "b", nlpar = "B") +
  prior(prior = "student_t(3, 0, 4.2)", class = "b", nlpar = "C")

make_stancode(formula = twoNexpsigmaform, data = dataset, family = mix, 
              prior = twoNexpsigmaprior)

inits1 = list(ordered_Intercept = array(c(32,37)), b_A = array(3.86,1), 
              b_B = array(0.59,1), b_C = array(-22.67,1), 
              theta = array(c(0.6,0.4)), 
              sd_1 = array(1.35,1), z_1 = matrix(0.5,1,231),
              sd_2 = array(1.35,1), z_2 = matrix(0.5,1,231))
inits = list(inits1, inits1, inits1, inits1)

mix3 = brm(data = dataset,
           family = mix,
           prior = twoNexpsigmaprior,
           formula = twoNexpsigmaform,
           seed = 1,
           iter = 5000,
           warmup = 1000,
           init = inits,
           save_pars = save_pars(all = TRUE),
           chains = 4,
           thin = 1, 
           cores = getOption("mc.cores", 4),
           control = list(adapt_delta = 0.99),
           file = "mix3",
           file_refit = getOption("brms.file_refit", "always")
)

whereas for mix2 which is a non-mixture (single component) gaussian, it runs fine:

singleNexpsigmaform = bf(y | cens(cen1) + trunc(lb = 1) ~ intercept) +
  lf(intercept ~ 1 | SUBJ) +
  nlf(sigma ~ A*exp(B*y + C)) +
  lf(A + B + C ~ 1) + 
  set_nl(nl = TRUE)

singleNexpsigmaprior = prior(prior = "student_t(3, 34.5, 4.2)", class = "b", 
                                coef = "Intercept", nlpar = "intercept") +
  prior(prior = "student_t(3, 0, 4.2)", class = "sd", nlpar = "intercept", 
        lb = 0) +
  prior(prior = "student_t(3, 0, 4.2)", class = "b", nlpar = "A", lb = 0) +  
  prior(prior = "student_t(3, 0, 4.2)", class = "b", nlpar = "B") +
  prior(prior = "student_t(3, 0, 4.2)", class = "b", nlpar = "C")

make_stancode(formula = singleNexpsigmaform, data = dataset, family = gaussian(), 
              prior = singleNexpsigmaprior)

inits1 = list(b_intercept = array(39,1), b_A = array(1/35,1), 
              b_B = array(1/35,1), b_C = array(1/35,1), sd_1 = array(1,1), z_1 = matrix(0.5,1,231))
inits = list(inits1, inits1, inits1, inits1)

mix2 = brm(data = dataset,
           family = gaussian(),
           prior = singleNexpsigmaprior,
           formula = singleNexpsigmaform,
           seed = 1,
           iter = 5000,
           warmup = 1000,
           init = inits,
           save_pars = save_pars(all = TRUE),
           chains = 4,
           thin = 1, 
           cores = getOption("mc.cores", 4),
           control = list(adapt_delta = 0.99),
           file = "mix2",
           file_refit = getOption("brms.file_refit", "always")
)
loo_subsample(mix2, cores = 4)

Try running a smaller version of your subsample with just 1 core (might take a while). But this will help debug the problem. In my case, for example, turns out that I needed the extraDistr package and that message didn’t appear until I ran it only with one core:

#Check smaller subsample is able to run
loo_subsample(mix2, observations = 5, cores =1, ndraws = 3)
1 Like