Parallelized loading of csv files

I have 20 csv files that each take around 15mins to load into R, which adds up to 5h when done sequentially with either as_cmdstan_fit, or read_cmdstan_csv. Since 5h is really a lot, I’ve parallelized the the process like this

read_single_file <- function(file) read_cmdstan_csv(file, variables = c("beta", "sigma", "gamma" )$post_warmup_draws
results <- parLapply(cl, file_names, read_single_file)
fit <- do.call(function(...) bind_draws(..., along = "chain"), results)

which also allows for reading in only a subset of parameters via read_cmdstan_csv’s variables argument. Because the majority of parameters in my model are latent variables that are not of interest, this also reduces the computational load greatly.

However, I have not been able to instantiate a CmdstanMCMC object (ie the sort of object that cmdstan returns immediately after the estimation is done), which is required by some methods in bayesplot. Is there a way to either
(a) coerce the returns of read_cmdstan_csv() into a CmdstanMCMC object or
(b) using as_cmdstan_fit(), select a subset of parameters to be loaded, and ideally also parallelize the loading process?

What particular bayesplot methods are you trying to use? It may be easier to address the problem from that end.

Incidentally, I’ve found solid success in improving stan CSV reading speed by using data.table:fread(). Here is some code used to read in just the true parameters and the inverse matrix for the purpose of initializing future runs.

# requires  dplyr, purrr, stringr, readr

#' @param working_csv a single stan csv file
#' @param compiled_model a CmdStanModel
#' @param threads number of threads to use with fread.
init_csv_read = \(working_csv,  compiled_model, threads = parallel::detectCores()) {
  
  # Get the names of the real parameters (not transformed or GQ)
  parms = compiled_model$variables()$parameters |> names()
  # Get column names
  all_cols = read_lines(working_csv, n_max = 51) |>
    # The uncommented line should be the column names
    str_subset("^#", negate = TRUE) |> str_split(',') |> unlist() 
  n_samples = read_lines(working_csv, skip = 7, n_max = 1) |> str_sub(-4L) |> as.integer()
  # Create a vector of indices indicating which all_cols match the parms:
  col_idx = map(parms, \(p) all_cols |> str_detect(paste0('^', p))) |> reduce(`|`)   |> which()
  col_names = all_cols[col_idx]
  n_skip = 52L
  samples = data.table::fread(working_csv, sep = ',', header = FALSE, nrows = n_samples, 
                              na.strings = '', skip = n_skip, col.names = col_names,
                              select = col_idx, colClasses = 'double', # col_classes,
                              data.table = FALSE, nThread = threads) |>
    mutate(.iteration = 1:n(), .chain = 1) |> mutate(.draw = 1:n()) |> posterior::as_draws_df()
  inv_metric = working_csv |> read_lines(skip =51, n_max = 1) |> str_remove('# ')  |> 
    str_split(', ') |> unlist() |> as.numeric()
  list(samples = samples, inv_metric = inv_metric)
}

You could probably tweak it to make parms an argument and it will likely read things a bit faster than the default approach.

1 Like

Basically anything that needs access to meta data or diagnostics information. There’s always a way to add that to the frankensteined fit object that you & I have constructed manually, so that it works like the original one, but I had hoped there would be an easy way to instantiate the original object to avoid that anytime you pick a different visualization in bayesplot.

Your code looks well written, I’ll try to give it a go :)

Looking at the source code for read_cmdstan_csv(), they already use fread() for data input, though they haven’t enabled threading with it (which may be worth requesting). You may be able to enable it by using data.table::setDTthreads(). I would try this first; it may even let you use as_cmdstan_fit(file_names) directly.

The other thing to note is that that it looks like you could do something like imported_stan_draws |> cmdstanr:::CmdStanMCMC_CSV$new(files = file_names, check_diagnostics = TRUE) to convert your imported CSV draws into a cmdstan fit.

1 Like