Importing large cmdstan csv-files to R

Ole_Petter_Hansen · May 10, 2017, 7:21am

Hello everyone,

I’m estimating large stan models, where I only care about a small subset of the parameters. I’m using cmdstan for the estimation, save chains as csv-files, and do postprocessing in r.

I’m starting to run into problems with memory, as each mcmc-chain is several gb and hence I don’t have enough memory for consolidating them directly with read_stan_csv.

Is there a simple way to import one chain at the time to r with read_stan_csv, subset the chain to the relevant parameters, and finally combine the subsetted chains using sflist2stanfit?

thanks,
Ole-Petter

Ole_Petter_Hansen · May 10, 2017, 12:08pm

My best guess so far is to import csv-files one at the time, and convert them to mcmc.lists. The “As.mcmc.list” function takes a pars argument, so I can then strip of uninteresting parameters.

Thereafter use “as.mcmc.list” from coda library, and combine chains. I can then convert this to a shinystan. E.g.:

mcmc1 <- As.mcmc.list(read_stan_csv(“cmdstan_output_1.csv”)), pars=pars)
mcmc2 <- As.mcmc.list(read_stan_csv(“cmdstan_output_2.csv”)), pars=pars)
mcmc <- as.shinystan(as.mcmc.list(c(mcmc1,mcmc2)

This works ok, but I lose the diagnostics etc I would have if I could directly combine them to a shinystan object. I would be happy to hear if anyone have better ideas.

best,
Ole-Petter

betanalpha · May 10, 2017, 2:10pm

Unix is your friend.

> awk -F, '{for(n = col_start; n < col_final; ++n) printf("%f,",$n); printf("\n")}' output.csv

sakrejda · May 10, 2017, 2:38pm

It’s the one cmd line tool I never had to learn to use much…

vincent-picaud · May 10, 2017, 5:29pm

I do agree with @betanalpha, Unix is your friend.

I have written some tips concerning bash & csv files:

maybe it can help

Vincent

Ole_Petter_Hansen · May 10, 2017, 5:40pm

Thanks to both of you for helpful comments. I’m using Windows, but I see this might complicate my life more than necessary…

wds15 · May 11, 2017, 6:38am

… Rtools on Windows also brings you gawk. As you can compile stan models on Windows, you do have a working Rtools… in R you could try

system(“gawk …”)

I am assuming that the paths are setup correctly in R (I haven’t tried, but it should be easy to fix).

Bob_Carpenter · May 15, 2017, 6:57pm

Cygwin is your friend on windows. But it might not help if you don’t know unix commands.

You didn’t share your Stan program, but if the space is taken up by transformed parameters, make them local variables in the model block.

Ole_Petter_Hansen · May 15, 2017, 7:08pm

Thanks. Unfortunately the parameters I want to drop are declared in the parameters section. I’m considering switching to Linux though - can’t be that much worse than R :-)

It would however be great if you would consider adding the pars-argument to cmdstan, as in rstan.

Bob_Carpenter · May 15, 2017, 7:16pm

We’ll do that eventually. I just added an issue:

Ole_Petter_Hansen · May 31, 2017, 6:29am

The Tidyverse contains a csv-reader that allows you to select which columns to import. Maybe this can be a fairly simple way for R-users to extract variables from large csv-files.

I worked around this issue, so I haven’t tried it myself, so I’ll just leave the suggestion here for posterity.

Bob_Carpenter · May 31, 2017, 6:10pm

It’s good to know R has tools for dealing with post-trimming. Alas, the typical problem case is that it either takes up too much memory or file space to save all the draws.

sakrejda · May 31, 2017, 6:24pm

What scale is that at? I’m saving half a million parameters for a model I’m developing with CmdStan + R on a laptop and I wouldn’t do full runs there due to diskspace but if I split the header out with sed and read the file in with data.table::fread it’s really fast and convenient. It’s a good laptop but nothing amazing.

Bob_Carpenter · May 31, 2017, 6:28pm

It’s 8 bytes per parameter per draw in binary. So each iteration at 5e5 parameters is 4e6 bytes (4 MB). If you take 1000 draws, that 4e9 (4 GB). Then if you do four chains, that’s 1.6e10 (16GB), and then most laptops break if you try to do that all in R. The problem is usually 50K parameters for 100K iterations, because people have it in mind that you need a lot of iterations as a holdover from the slow mixing of Gibbs and Metropolis.

sakrejda · May 31, 2017, 6:33pm

Yeah that makes sense, but for 90% of model development you can stick to a laptop even if the model is huge—by the time you’re saving 16GB you should be well past checking convergence etc… I keep wanting to make a flowchart for people (but haven’t gotten around to it obv.)

Bob_Carpenter · May 31, 2017, 6:37pm

Indeed. A flowchart might help. Anything that can convey that it’s a bad idea to do a gazillion draws. What would be great if we could help users avoid trying to generate 100K draws off the bat and waiting a week.

sakrejda · May 31, 2017, 6:44pm

… followed by adding “Stan was too slow so I wrote my own Gibbs sampler” to their talk…

csetraynor · September 19, 2019, 6:41pm

It is simple and probably are other ways to do it but after fighting one hour with fread, read_csv and comand line tools. The best option that worked for me was to create the following R function:

# file: is the name of the csv file for the chain (without the .csv ending)
# vars: some variables that you want to select from the draws
select_vars_cmdstan <- function(file, vars){
  uncomented <- paste0(file, "_uncomented.csv")
  
  system(paste0("sed -e '/^#/d' ",  paste0(file, ".csv"),  " > ",uncomented)) ## remove coment lines
  
  post <- data.table::fread(uncomented,
                 select = vars)
  readr::write_csv(post, paste0(file, ".csv"))
  system(paste0("rm ", uncomented))
}

The only problem with this is that it will efectively remove the file removing all important meta-data such as the inverse matrix.
For me it is what I need because each chain can produce a ~10GB file and with the current quota I can not afford store all the chains with all the individual parameters.

Bob_Carpenter · September 22, 2019, 3:48pm

What effective sample size are you targeting and how many parameters are there? 10GB is roughly a billion numbers, which would be 1K draws of 1M parameters.

In RStan, you can prefiler which named variables to save (params, transformed params, generated quantities).

Christopher-Peterson · September 23, 2019, 5:15pm

The vroom package looks like it may be helpful; it has a similar interface to readr::read_csv(), but is supposedly quite a bit faster than data.table::fread(). It also supports only reading certain columns in, with the same interface as dplyr::select().

Topic		Replies	Views
Import certain parameters from cmdstan .csv output files to Stanfit object General cmdstan , techniques	3	907	March 19, 2021
Import csv output from cmdstan in R: How to indicate the chain? General cmdstan , r	12	1594	February 3, 2021
cmdstanPy to cmdStanR CmdStan	13	762	December 2, 2020
Saving executables compiled using RStan Interfaces cmdstan , rstan	5	1412	September 30, 2017
Parallelized loading of csv files Developers r , cmdstanr , posterior-package	3	53	September 6, 2024

Importing large cmdstan csv-files to R

Related topics