Using tidybayes::add_epred_rvars()
is super convenient, as @wds15 and @mjskay were discussing. Love it!
Unfortunately, the efficiency seems to take a huge hit when writing the result to disk. Compared to using add_epred_draws()
, we’re talking about a factor of 1000x in the following example. I’m wondering why this is the case. I’d love to keep using rvars in my workflow, but I can’t afford to write out several GB of data for each set of draws in my actual work!
Here’s a little example (don’t worry, it won’t use more than 100 MB, and it cleans up after itself). Any insights? Thanks in advance for your time and attention …
library(posterior)
library(tidybayes)
library(brms)
library(pryr)
# Adjust to suit your preferred backend
fit_logistic <- function (...) {
fit <- brm(..., backend = "cmdstanr", family = bernoulli())
return(fit)
}
# Simulate data and fit
rbernoulli <- function (n, p = 0.5) runif(n) > (1 - p)
obs <- data.frame(Y = rbernoulli(1000))
fit <- fit_logistic(Y ~ 1, data = obs)
#'
#' Helper function
#'
#' - Report "in-memory" size, using `pryr::object_size()`
#' - Write to tempfile using `saveRDS(..., compress = FALSE)`
#' - Report size on disk, using `file.size()`
#' - Clean up
#'
report_kB <- function (obj) {
format_kB <- function (x) paste0(round(x / 2^10, 1), " kB")
tmpfn <- tempfile(fileext = ".rds")
saveRDS(obj, tmpfn, compress = FALSE)
message("object_size(): ", format_kB(pryr::object_size(obj)))
message("file.size(): ", format_kB(file.size(tmpfn)))
invisible(file.remove(tmpfn))
}
#'
#' Using `add_epred_draws()`
#'
#' - Take the first 1, 10, 100, or 1000 rows
#' - Ranges from about 1 to 30 kB on disk
#'
draws_dat <- add_epred_draws(obs, fit, ndraws = 100)
report_kB(head(draws_dat, 1)) # only 0.7 kB on disk
report_kB(head(draws_dat, 10))
report_kB(head(draws_dat, 100))
report_kB(head(draws_dat, 1000))
#'
#' Using `add_epred_rvars()`
#'
#' - Same as above
#' - Except the sizes are much, much larger --- why?
#'
rvars_dat <- add_epred_rvars(obs, fit, ndraws = 100)
report_kB(head(rvars_dat, 1)) # already 780 kB on disk (compare to 0.7 kB for `draws_dat`)
report_kB(head(rvars_dat, 10)) # not much bigger in mem, but 10x larger on disk (7.8 MB)
report_kB(head(rvars_dat, 100)) # not much bigger in mem, but 100x larger on disk (78 MB)
#report_kB(head(rvars_dat, 1000)) # only run if you have 1 GB free on disk!
P.S. I was a bit surprised to see that the in-memory sizes for rvars_dat
slices don’t scale linearly with the number of rows. They do scale linearly on disk. It makes me wonder if something that is “by reference” in memory is being written multiple times to disk.