Importing large cmdstan csv-files to R


#1

Hello everyone,

I’m estimating large stan models, where I only care about a small subset of the parameters. I’m using cmdstan for the estimation, save chains as csv-files, and do postprocessing in r.

I’m starting to run into problems with memory, as each mcmc-chain is several gb and hence I don’t have enough memory for consolidating them directly with read_stan_csv.

Is there a simple way to import one chain at the time to r with read_stan_csv, subset the chain to the relevant parameters, and finally combine the subsetted chains using sflist2stanfit?

thanks,
Ole-Petter


Good workflow for getting samples from PyStan (python) into ShinyStan (R)?
#2

My best guess so far is to import csv-files one at the time, and convert them to mcmc.lists. The “As.mcmc.list” function takes a pars argument, so I can then strip of uninteresting parameters.

Thereafter use “as.mcmc.list” from coda library, and combine chains. I can then convert this to a shinystan. E.g.:

mcmc1 <- As.mcmc.list(read_stan_csv(“cmdstan_output_1.csv”)), pars=pars)
mcmc2 <- As.mcmc.list(read_stan_csv(“cmdstan_output_2.csv”)), pars=pars)
mcmc <- as.shinystan(as.mcmc.list(c(mcmc1,mcmc2)

This works ok, but I lose the diagnostics etc I would have if I could directly combine them to a shinystan object. I would be happy to hear if anyone have better ideas.

best,
Ole-Petter


#3

Unix is your friend.

> awk -F, '{for(n = col_start; n < col_final; ++n) printf("%f,",$n); printf("\n")}' output.csv

#4

It’s the one cmd line tool I never had to learn to use much…


#5

I do agree with @betanalpha, Unix is your friend.

I have written some tips concerning bash & csv files:


maybe it can help

  • Vincent

#6

Thanks to both of you for helpful comments. I’m using Windows, but I see this might complicate my life more than necessary…


#7

… Rtools on Windows also brings you gawk. As you can compile stan models on Windows, you do have a working Rtools… in R you could try

system(“gawk …”)

I am assuming that the paths are setup correctly in R (I haven’t tried, but it should be easy to fix).


#8

Cygwin is your friend on windows. But it might not help if you don’t know unix commands.

You didn’t share your Stan program, but if the space is taken up by transformed parameters, make them local variables in the model block.


#9

Thanks. Unfortunately the parameters I want to drop are declared in the parameters section. I’m considering switching to Linux though - can’t be that much worse than R :-)

It would however be great if you would consider adding the pars-argument to cmdstan, as in rstan.


#10

We’ll do that eventually. I just added an issue:


#11

The Tidyverse contains a csv-reader that allows you to select which columns to import. Maybe this can be a fairly simple way for R-users to extract variables from large csv-files.

I worked around this issue, so I haven’t tried it myself, so I’ll just leave the suggestion here for posterity.


#12

It’s good to know R has tools for dealing with post-trimming. Alas, the typical problem case is that it either takes up too much memory or file space to save all the draws.


#13

What scale is that at? I’m saving half a million parameters for a model I’m developing with CmdStan + R on a laptop and I wouldn’t do full runs there due to diskspace but if I split the header out with sed and read the file in with data.table::fread it’s really fast and convenient. It’s a good laptop but nothing amazing.


#14

It’s 8 bytes per parameter per draw in binary. So each iteration at 5e5 parameters is 4e6 bytes (4 MB). If you take 1000 draws, that 4e9 (4 GB). Then if you do four chains, that’s 1.6e10 (16GB), and then most laptops break if you try to do that all in R. The problem is usually 50K parameters for 100K iterations, because people have it in mind that you need a lot of iterations as a holdover from the slow mixing of Gibbs and Metropolis.


#15

Yeah that makes sense, but for 90% of model development you can stick to a laptop even if the model is huge—by the time you’re saving 16GB you should be well past checking convergence etc… I keep wanting to make a flowchart for people (but haven’t gotten around to it obv.)


#16

Indeed. A flowchart might help. Anything that can convey that it’s a bad idea to do a gazillion draws. What would be great if we could help users avoid trying to generate 100K draws off the bat and waiting a week.


#17

… followed by adding “Stan was too slow so I wrote my own Gibbs sampler” to their talk…