Why does serialization of rvars take so much disk space?

Using tidybayes::add_epred_rvars() is super convenient, as @wds15 and @mjskay were discussing. Love it!

Unfortunately, the efficiency seems to take a huge hit when writing the result to disk. Compared to using add_epred_draws(), we’re talking about a factor of 1000x in the following example. I’m wondering why this is the case. I’d love to keep using rvars in my workflow, but I can’t afford to write out several GB of data for each set of draws in my actual work!

Here’s a little example (don’t worry, it won’t use more than 100 MB, and it cleans up after itself). Any insights? Thanks in advance for your time and attention …

library(posterior)
library(tidybayes)
library(brms)
library(pryr)

# Adjust to suit your preferred backend
fit_logistic <- function (...) {
  fit <- brm(..., backend = "cmdstanr", family = bernoulli())
  return(fit)
}

# Simulate data and fit
rbernoulli <- function (n, p = 0.5) runif(n) > (1 - p)
obs <- data.frame(Y = rbernoulli(1000))
fit <- fit_logistic(Y ~ 1, data = obs)

#'
#' Helper function
#'
#' - Report "in-memory" size, using `pryr::object_size()`
#' - Write to tempfile using `saveRDS(..., compress = FALSE)`
#' - Report size on disk, using `file.size()`
#' - Clean up
#'
report_kB <- function (obj) {
  format_kB <- function (x) paste0(round(x / 2^10, 1), " kB")
  tmpfn <- tempfile(fileext = ".rds")
  saveRDS(obj, tmpfn, compress = FALSE)
  message("object_size(): ", format_kB(pryr::object_size(obj)))
  message("file.size():   ", format_kB(file.size(tmpfn)))
  invisible(file.remove(tmpfn))
}

#'
#' Using `add_epred_draws()`
#'
#' - Take the first 1, 10, 100, or 1000 rows
#' - Ranges from about 1 to 30 kB on disk
#'
draws_dat <- add_epred_draws(obs, fit, ndraws = 100)
report_kB(head(draws_dat, 1))  # only 0.7 kB on disk
report_kB(head(draws_dat, 10))
report_kB(head(draws_dat, 100))
report_kB(head(draws_dat, 1000))

#'
#' Using `add_epred_rvars()`
#'
#' - Same as above
#' - Except the sizes are much, much larger --- why?
#'
rvars_dat <- add_epred_rvars(obs, fit, ndraws = 100)
report_kB(head(rvars_dat, 1))   # already 780 kB on disk (compare to 0.7 kB for `draws_dat`)
report_kB(head(rvars_dat, 10))  # not much bigger in mem, but 10x larger on disk (7.8 MB)
report_kB(head(rvars_dat, 100)) # not much bigger in mem, but 100x larger on disk (78 MB)
#report_kB(head(rvars_dat, 1000)) # only run if you have 1 GB free on disk!

P.S. I was a bit surprised to see that the in-memory sizes for rvars_dat slices don’t scale linearly with the number of rows. They do scale linearly on disk. It makes me wonder if something that is “by reference” in memory is being written multiple times to disk.

1 Like

@mjskay

This is probably due to the same problem I had to solve for the rmsb package where a function is stored in the fit object. When serializing to say .rds files, R stores the environment of a function. I had to save a character string form of the function in the fit object to get rid of the environment, and I parse the string every time I need to operate on the fit object’s function.

3 Likes

Thanks for the tip @harrelfe. If that’s the case, refhooks might suffice for my case. I’m saving the fit objects separately.

Happy to file an issue at the GitHub repo for the relevant package, if appropriate. Would that be posterior, tidybayes, or something else? Cc @mjskay

@harrelfe is correct, this is an issue related to an environment stored in the rvar object that contains a cache used to speed up some operations. It’s been on my list for a little while and I hope to get around to documenting it and providing an easy solution sometime this summer. You can see more at this github issue, which also shows how to use a ref hook to get around it in the mean time.

2 Likes

I never knew about refhooks. But it looks to be a lot more complicated than just storing a character string version of the function.

Yeah, I should be clear: the problem here isn’t that we’re storing a function, it’s that we’re storing an environment used to implement a caching scheme to get around some performance issues with {vctrs}, as explained in the linked issue. For anyone who cares, see more discussion on the corresponding vctrs issue (though that references an older implementation of the caching scheme, it’s a similar idea).

Short term solution for serialization is to use a ref hook or manually invalidate the cache, medium term is I implement a nice way to do the cache invalidation recursively on objects containing rvars (avoiding need for refhooks), long term solution is I either find a different internal structure for rvars that still works with vctrs but avoids caching or I convince the vctrs folks to fix this issue.

3 Likes