Brms limited memory issue while running on 15M data points

Hello,

I am trying to run brms model on a 15M data points in train while another 6M in test data. When I run the model it throughs following error:
“Error in collapse_object(objnames, tmp, indent) : R character strings are limited to 2^31-1 bytes”.

I have also tried run the brms model on 9M datapoint in train data and that works fine.

Some stats regarding the model:

  1. Features: 26. Priors for 17 features are from beta distribution and rest are from normal distribution.
  2. total data: 21 M
    Code Snippet:
library(rstan)
library(brms)

data <- read_parquet('/path/to/file')
# some transformations
train_size <- floor(0.7* 21*10^6)
train <- data[1:train_size, ]
test <- data[train_size:dim(data)[1], ]

my_prior <- c(
  prior(normal(0,1), class = 'b', nlpar='intercept') + 
  prior(beta(16.9, 152.21, class = 'b', nlpar='x1', lb=0, ub=0) +
  prior(beta(16.9, 152.21, class = 'b', nlpar='x2', lb=0, ub=0) +
  prior(beta(16.9, 152.21, class = 'b', nlpar='x3', lb=0, ub=0) +
  .......... similarly for 23 more features
)

model = brm_multiple(
    bf(y ~ Intercept + x1 + x2 + x3 + ...., nl=True) + lf(intercept ~ 1) +
    lf(x1 ~ 0 + x_1) + lf(x2 ~ 0 + x_2) + lf(x3  ~ 0 +x_3) + ...... for 23 more features,
    data = df_split, family = bernoulli("logit"), backend = "cmdstanr",
    threads = threading(15, grainsize = 625), prior = my_prior, 
    warmup = 1000, chains = 4, cores = 12, seed = 12345, 
    iter = 2000, silent = FALSE, thin = 1)

plot(model)

Can someone please help me out regarding this issue.

Kind Regards,

Hi,
this appears related to a known issue: stan_rdump fails for large arrays · Issue #595 · stan-dev/rstan · GitHub

It might work better if you use the alternative Stan interface cmdstanr (see Getting started with CmdStanR • cmdstanr for installation instructions), and then you can set options(brms.backend = "cmdstanr") to have brms use it.

This looks like a huuuuge model, so hope you’ll be able to make it work (I presume you’ve already tested that the model works on smaller datasets?)

Best of luck with your model

Thank you for the reply. Yes, I have checked my model on a small dataset. And I am already using cmdstanr in the code snippet.

1 Like

Same issue here. That’s all :(

Silly Q, but is this a Stan issue or R issue? Aka can you write your data to a .json from R? If so you can call cmdstanr pointing it to the json file

My understanding of the discussion so far and my own googling is that this is an R issue.

Based on the linked issue above, it looks like writing the data to a .json is the only option for now. (but, doing so is beyond my capability. I’ll have to do some more poking around to learn how… )

If you can find a way to write the json data to a file I think you should be able to use cmdstanr pointing to the json file

1 Like

I’ll try and figure this out. I’ve only used cmdstanr as a backend with brms for the within-chain parallelization. We’ll see how it goes…

I see that you’re using a Bernoulli likelihood. Assuming that some of your 15 million rows have the same features, could you convert the only distinct combinations of features and then model it using a binomial likelihood instead? Your inferences and all the parameter estimates would be the same, but you’d have a lot fewer rows of data to deal with.

1 Like

Hello @jackbailey ,

If I understand you correctly, you are talking about the columns in the data. So, all the 26 features are distinct features in my case. The data is definitely spare and I have already tried using spare argument to check if that would allow me to use the entire train dataset but to no avail.