Importing large cmdstan csv-files to R

Hello everyone,

I’m estimating large stan models, where I only care about a small subset of the parameters. I’m using cmdstan for the estimation, save chains as csv-files, and do postprocessing in r.

I’m starting to run into problems with memory, as each mcmc-chain is several gb and hence I don’t have enough memory for consolidating them directly with read_stan_csv.

Is there a simple way to import one chain at the time to r with read_stan_csv, subset the chain to the relevant parameters, and finally combine the subsetted chains using sflist2stanfit?

thanks,
Ole-Petter

My best guess so far is to import csv-files one at the time, and convert them to mcmc.lists. The “As.mcmc.list” function takes a pars argument, so I can then strip of uninteresting parameters.

Thereafter use “as.mcmc.list” from coda library, and combine chains. I can then convert this to a shinystan. E.g.:

mcmc1 <- As.mcmc.list(read_stan_csv(“cmdstan_output_1.csv”)), pars=pars)
mcmc2 <- As.mcmc.list(read_stan_csv(“cmdstan_output_2.csv”)), pars=pars)
mcmc <- as.shinystan(as.mcmc.list(c(mcmc1,mcmc2)

This works ok, but I lose the diagnostics etc I would have if I could directly combine them to a shinystan object. I would be happy to hear if anyone have better ideas.

best,
Ole-Petter

Unix is your friend.

> awk -F, '{for(n = col_start; n < col_final; ++n) printf("%f,",$n); printf("\n")}' output.csv
2 Likes

It’s the one cmd line tool I never had to learn to use much…

I do agree with @betanalpha, Unix is your friend.

I have written some tips concerning bash & csv files:


maybe it can help

  • Vincent
1 Like

Thanks to both of you for helpful comments. I’m using Windows, but I see this might complicate my life more than necessary…

… Rtools on Windows also brings you gawk. As you can compile stan models on Windows, you do have a working Rtools… in R you could try

system(“gawk …”)

I am assuming that the paths are setup correctly in R (I haven’t tried, but it should be easy to fix).

Cygwin is your friend on windows. But it might not help if you don’t know unix commands.

You didn’t share your Stan program, but if the space is taken up by transformed parameters, make them local variables in the model block.

Thanks. Unfortunately the parameters I want to drop are declared in the parameters section. I’m considering switching to Linux though - can’t be that much worse than R :-)

It would however be great if you would consider adding the pars-argument to cmdstan, as in rstan.

We’ll do that eventually. I just added an issue:

The Tidyverse contains a csv-reader that allows you to select which columns to import. Maybe this can be a fairly simple way for R-users to extract variables from large csv-files.

I worked around this issue, so I haven’t tried it myself, so I’ll just leave the suggestion here for posterity.

1 Like

It’s good to know R has tools for dealing with post-trimming. Alas, the typical problem case is that it either takes up too much memory or file space to save all the draws.

What scale is that at? I’m saving half a million parameters for a model I’m developing with CmdStan + R on a laptop and I wouldn’t do full runs there due to diskspace but if I split the header out with sed and read the file in with data.table::fread it’s really fast and convenient. It’s a good laptop but nothing amazing.

It’s 8 bytes per parameter per draw in binary. So each iteration at 5e5 parameters is 4e6 bytes (4 MB). If you take 1000 draws, that 4e9 (4 GB). Then if you do four chains, that’s 1.6e10 (16GB), and then most laptops break if you try to do that all in R. The problem is usually 50K parameters for 100K iterations, because people have it in mind that you need a lot of iterations as a holdover from the slow mixing of Gibbs and Metropolis.

Yeah that makes sense, but for 90% of model development you can stick to a laptop even if the model is huge—by the time you’re saving 16GB you should be well past checking convergence etc… I keep wanting to make a flowchart for people (but haven’t gotten around to it obv.)

Indeed. A flowchart might help. Anything that can convey that it’s a bad idea to do a gazillion draws. What would be great if we could help users avoid trying to generate 100K draws off the bat and waiting a week.

1 Like

… followed by adding “Stan was too slow so I wrote my own Gibbs sampler” to their talk…

2 Likes

It is simple and probably are other ways to do it but after fighting one hour with fread, read_csv and comand line tools. The best option that worked for me was to create the following R function:

# file: is the name of the csv file for the chain (without the .csv ending)
# vars: some variables that you want to select from the draws
select_vars_cmdstan <- function(file, vars){
  uncomented <- paste0(file, "_uncomented.csv")
  
  system(paste0("sed -e '/^#/d' ",  paste0(file, ".csv"),  " > ",uncomented)) ## remove coment lines
  
  post <- data.table::fread(uncomented,
                 select = vars)
  readr::write_csv(post, paste0(file, ".csv"))
  system(paste0("rm ", uncomented))
}

The only problem with this is that it will efectively remove the file removing all important meta-data such as the inverse matrix.
For me it is what I need because each chain can produce a ~10GB file and with the current quota I can not afford store all the chains with all the individual parameters.

What effective sample size are you targeting and how many parameters are there? 10GB is roughly a billion numbers, which would be 1K draws of 1M parameters.

In RStan, you can prefiler which named variables to save (params, transformed params, generated quantities).

The vroom package looks like it may be helpful; it has a similar interface to readr::read_csv(), but is supposedly quite a bit faster than data.table::fread(). It also supports only reading certain columns in, with the same interface as dplyr::select().

2 Likes