Brms limited memory issue while running on 15M data points

pb13 · June 15, 2021, 3:57pm

Hello,

I am trying to run brms model on a 15M data points in train while another 6M in test data. When I run the model it throughs following error:
“Error in collapse_object(objnames, tmp, indent) : R character strings are limited to 2^31-1 bytes”.

I have also tried run the brms model on 9M datapoint in train data and that works fine.

Some stats regarding the model:

Features: 26. Priors for 17 features are from beta distribution and rest are from normal distribution.
total data: 21 M
Code Snippet:

library(rstan)
library(brms)

data <- read_parquet('/path/to/file')
# some transformations
train_size <- floor(0.7* 21*10^6)
train <- data[1:train_size, ]
test <- data[train_size:dim(data)[1], ]

my_prior <- c(
  prior(normal(0,1), class = 'b', nlpar='intercept') + 
  prior(beta(16.9, 152.21, class = 'b', nlpar='x1', lb=0, ub=0) +
  prior(beta(16.9, 152.21, class = 'b', nlpar='x2', lb=0, ub=0) +
  prior(beta(16.9, 152.21, class = 'b', nlpar='x3', lb=0, ub=0) +
  .......... similarly for 23 more features
)

model = brm_multiple(
    bf(y ~ Intercept + x1 + x2 + x3 + ...., nl=True) + lf(intercept ~ 1) +
    lf(x1 ~ 0 + x_1) + lf(x2 ~ 0 + x_2) + lf(x3  ~ 0 +x_3) + ...... for 23 more features,
    data = df_split, family = bernoulli("logit"), backend = "cmdstanr",
    threads = threading(15, grainsize = 625), prior = my_prior, 
    warmup = 1000, chains = 4, cores = 12, seed = 12345, 
    iter = 2000, silent = FALSE, thin = 1)

plot(model)

Can someone please help me out regarding this issue.

Kind Regards,

martinmodrak · June 21, 2021, 7:20pm

Hi,
this appears related to a known issue: stan_rdump fails for large arrays · Issue #595 · stan-dev/rstan · GitHub

It might work better if you use the alternative Stan interface cmdstanr (see Getting started with CmdStanR • cmdstanr for installation instructions), and then you can set options(brms.backend = "cmdstanr") to have brms use it.

This looks like a huuuuge model, so hope you’ll be able to make it work (I presume you’ve already tested that the model works on smaller datasets?)

Best of luck with your model

pb13 · July 20, 2021, 4:30am

Thank you for the reply. Yes, I have checked my model on a small dataset. And I am already using cmdstanr in the code snippet.

Peter_Clayson · July 25, 2021, 2:44pm

Same issue here. That’s all :(

stevebronder · July 25, 2021, 6:07pm

Silly Q, but is this a Stan issue or R issue? Aka can you write your data to a .json from R? If so you can call cmdstanr pointing it to the json file

Peter_Clayson · July 25, 2021, 6:44pm

My understanding of the discussion so far and my own googling is that this is an R issue.

Based on the linked issue above, it looks like writing the data to a .json is the only option for now. (but, doing so is beyond my capability. I’ll have to do some more poking around to learn how… )

stevebronder · July 27, 2021, 9:11pm

If you can find a way to write the json data to a file I think you should be able to use cmdstanr pointing to the json file

Peter_Clayson · July 28, 2021, 11:16am

I’ll try and figure this out. I’ve only used cmdstanr as a backend with brms for the within-chain parallelization. We’ll see how it goes…

jackbailey · July 28, 2021, 12:09pm

I see that you’re using a Bernoulli likelihood. Assuming that some of your 15 million rows have the same features, could you convert the only distinct combinations of features and then model it using a binomial likelihood instead? Your inferences and all the parameter estimates would be the same, but you’d have a lot fewer rows of data to deal with.

pb13 · July 28, 2021, 12:40pm

Hello @jackbailey ,

If I understand you correctly, you are talking about the columns in the data. So, all the 26 features are distinct features in my case. The data is definitely spare and I have already tried using spare argument to check if that would allow me to use the entire train dataset but to no avail.

Topic		Replies	Views
Brms memory (RAM) overload brms brms	5	708	August 17, 2023
Memory issues using stan file created by brms brms	15	1050	November 9, 2018
Rstudio crashes just before producing a brms model General performance , cmdstanr , brms	2	423	November 28, 2023
Limits to JSON conversion for Large Data (R character strings are limited to 2^31-1 bytes) Interfaces cmdstanr	9	2827	July 14, 2025
Model summary vector memory limit reached brms	2	849	December 1, 2021

Brms limited memory issue while running on 15M data points

Related topics