Alternative .csv reader

maintenance
rstan

#1

@bgoodri @jonah I was processing CmdStan files without rstan and it’s irritatingly slow to use read.csv even with the bells and whistles. I wrote some c++ code to process the main part of the file (ignores comments, reads header, ignores mass matrix at the momen, reads parameter values). It produces compatible output to rstan::extract and reads/processes a 2.3 GB .csv file about 7x faster than rstan currently does (60s vs. 500s, roughly on my slow-ish HDD), including getting all the parameters split into their own arrays in a named list.

If there’s interest in updating the rstan::read_stan_csv code I could clean it up and put in a PR but I’m not sure just what the requirements are in rstan (I recall when we talked about changing the in-memory output layout to avid segfaults it sounded like we’d have to wait for rstan3 so I’m asking here first.

The code is in the stannis repo under stannis/src and stannis/inst/include, for some reason it’s the files named zoom*. The R-level function is not exported from the package, it’s stannis:::read_cmdstan_csv(filename)


#2

@aaronjg has a pull request that is probably fine for the next release


although we should probably switch to readr or data.table at some point, now that their dependency lists are not as long as they used to be.


#3

I haven’t run the benchmarks myself, but compared to what @davharris reported, my PR was faster than readr (haven’t done the comparison to the data.table package yet). It gets the speed improvements since it:

  1. Preallocates the entire data frame before reading in the files
  2. Reads the data and comments in a single pass rather than two passes
  3. Reads directly into the output data frame, rather than reading into a a matrix and then converting it into a data frame. (which means it needs to have enough RAM for two copies).

@sakrejda, I’d be curious if your code is faster than that. I wasn’t seeing 7x speedups, but a lot depends on whether the bottleneck is HDD, memory allocation, or CPU in converting strings to floats.

Also, if you aren’t using R 3.5 yet, you should upgrade, since they implemented some big performance improvements with respect to reading files.

I also have a branch that supports reading the csvs into matrices rather than data frames. However sampling et al. is still using data frames.

https://github.com/aaronjg/rstan/tree/use_matrices


#4

I can try your code out for a direct comparison. Is it pure R that I can run without installing an rstan branch? I’ll see I have R 3.5 installed…

My code is pushing 40MB/s on a 5400 rpm drive so it’s getting 30-50% (?) of max throughput. While adding functionality I noticed that parsing text to doubles + separators (commas) was the biggest hit to speed and reshaping arrays was a few percent more. I’m doing this in a single pass with bare string iterators so I’ll be impressed if pure R gets close. I am buffering per line before processing it so it could get a little faster if it was really a single pass… but probably not worth the dev time.


#5

It’s only uses base R functions, but the heavy lifting of parsing the text is done using the scan function, which is implemented it C. You can test it by cloning the branch (https://github.com/aaronjg/rstan/tree/for-2.18) and loading it with devtools, or you can just load the original rstan, then source stan_csv.R and misc.R from that branch and running read_stan_csv.


#6

Your code breaks on a .csv file with any embedded comments (so in my case I get an error when it runs into the mass matrix:

> system.time({o  = read_one_stan_csv('~/builds/contraceptive-4/fits/fit-1f3838ab2e6a50e37dcf3501950ae0fdfb1922a53bd3d2ca1be342515e1984d1/chain-1/output.csv')})
Error in scan(csvfile, what = double(), sep = ",", comment.char = "",  : 
  scan() expected 'a real', got '#Adaptationterminated'

I did try just scan by itself which doesn’t include parsing the header and here’s what I get:

> system.time({o  = scan('~/builds/contraceptive-4/fits/fit-1f3838ab2e6a50e37dcf3501950ae0fdfb1922a53bd3d2ca1be342515e1984d1/chain-1/output.csv', what = double(), sep =',', comment.char='#', skip=39)})
Read 260107900 items
   user  system elapsed 
213.999   2.865 216.869 

So no matter what using scan as the workhorse is 3.3x slower than the code I suggested. I did want to say I don’t think this is stan csv-reader olympics and I think rstan’s read speed is ok (way better than dealing with read.csv), I just shared the code in case there was interest.


#7

Hmm. I actually didn’t touch the read_one_stan_csv code at all, that is just used in reading the output from vb I believe.

Can you try the read_stan_csv function?

I’m actually quite surprised at how slow the scan function is on it’s own since that’s a pure C implementation.


#8

Ran your version, this is R 3.4.2 so I’ll update and check these again once that’s done.

> system.time({o  = read_stan_csv('~/builds/contraceptive-4/fits/fit-1f3838ab2e6a50e37dcf3501950ae0fdfb1922a53bd3d2ca1be342515e1984d1/chain-1/output.csv')})
   user  system elapsed 
581.865   3.508 585.355 

So roughly as slow as the rstan:: version.


#9

Very interesting. That looks different from my benchmarks. What’s the dimensionality of the data set that you are loading? I’m also curious to see the results with R-3.5.0, that had a big speed improvement in my tests.

Any idea what makes your code so much faster than R’s scan function? It looks like a pure C implementation: scan.c and strtod.


#10

I looked at scan.c quickly and they handle a huge amount of stuff I don’t like quoting and escape sequences. All of that involves branching at the per-character level. Stan’ s .csv files have none of that so when I scan a line I can get a double, skip a coma, and get another double. I calculate how many doubles I’ll get once from the header so I’m not even looking for newlines. Since big Stan files are wide (hundreds of thousands of columns in the file I’m testing on) not long there’s not much of a hit from buffering each line prior to parsing. Sort of makes sense but real answers require profiling.


#11

Nice. That’s a win in and of itself.

Indeed.


#12

Hi, can you compare fread from data.table package with some grep/cat/sed magic. It is not a general solution but would be interesting to see the comparison for large files.

system.time({o <- fread('grep -v "^#" ./data/output.csv', sep=',')})

system.time({o <- fread('cat ./data/output.csv | sed "/^[^#]/!d"', sep=',')})

edit.start

Is cat even needed?

system.time({o <- fread('sed "/^[^#]/!d" ./data/output.csv', sep=',')})

edit.end

For large files in python one could do either (numpy only)

PATH = "./data/output.csv"
with open(PATH, "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        header = line.split(',')
        break
    arr = np.loadtxt(f, delimiter=',') # or np.genfromtxt(f, delimiter=',')

and if one wants to do dataframe and infer dtypes

df = pd.DataFrame(data=arr, columns=header)
df = df1.apply(lambda x: pd.to_numeric(x, downcast='integer'), axis=0)

With pandas (fastest option with common libraries)

df = pd.read_csv(PATH, comment='#')

(filename with some utf-8 chars)

with open(PATH, "r") as f:
    df = pd.read_csv(f, comment='#')

These all are skipping comments, so one could second pass to extract those only or read it somehow differently.


#13

Here are some benchmarks on a smaller data set, using R-3.5.0 (which from my benchmarks on the PR, seems like it is roughly twice as fast as R-3.4.x).

rstan 2.17:

> summary(replicate(10,system.time(x <- read_stan_csv("output.csv"))[1]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.221   1.237   1.242   1.255   1.269   1.306

my PR

> summary(replicate(10,system.time(x <- read_stan_csv("output.csv"))[1]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.000   1.002   1.004   1.017   1.008   1.075

My branch using matrices rather than data frames:

> summary(replicate(10,system.time(x <- read_stan_csv("output.csv"))[1]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.8700  0.8740  0.8830  0.8819  0.8858  0.8980
 

The two above do some other stuff to create the RStan object, including parsing the comments, splitting off the diagnostic parameters, and computing means for the parameters.

Other functions that just read the CSV.

>  summary(replicate(10,system.time(x <- scan("output.csv",comment.char="#",skip=39,sep=",",quiet=TRUE))[1]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.2770  0.2802  0.2835  0.2827  0.2848  0.2880


> summary(replicate(10,system.time(x <- fread('cat output.csv | sed "/^[^#]/!d"', sep=','))[1]))
/^[^#]/!d"', sep=','))[1]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.1780  0.1820  0.1835  0.1836  0.1847  0.1900

> summary(replicate(10,system.time(x <- fread('grep -v "^#" output.csv', sep=','))[1]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.1830  0.1847  0.1885  0.1878  0.1908  0.1920

@sakrejda’s read_stan_csv

 summary(replicate(10,system.time(x <- stannis::read_stan_csv("output.csv"))[1]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.1850  0.1872  0.1880  0.1885  0.1905  0.1910

So with R-3.5, the difference between stannis and scan drops from 3.5x to 1.5x. So I think the big difference that @sakrejda was noticing was due to buffering in his implementation, and not in R-3.4.2’s scan implementation.

Looking at the profiling in my code, about 50% of the time is spent in doing other things to create the stanfit object, such as formatting the names properly, calculating the parameter means, etc. This is on 100 samples for 10,000 parameters, so that may change for different dimensions of the dataset.

Using the more efficient CSV reader/data frame builder could potentially improve the performance an additional 30% or so (maybe more since it looks like the stannis code does some additional reshaping that rstan does not) and still create the same Rstan stanfit objects.


#14

Another criterion — perhaps the main criterion — is peak RAM spike.


#15

Thanks for the comparison! Can’t believe I squeaked in as faster than fread!


#16

Darn, not quite as fast as fread, I still have some work to do. K


#17

@bengoodri I’m not exactly sure how to measure that, the max used gc stats from R seem to be sensitive to when garbage collection is run.

I tried getting at it by setting “ulimit -v” and seeing when R would fail.

It looks like loading R and the rstan libraries gives an overhead: 280Mb

Then, increasing ulimit by 10MB until the import runs gives:

mine: 330Mb - 280Mb = 50Mb
2.17.3: 350Mb - 280Mb = 70Mb


#18

fread is just reading the file, my benchmarks were calculating your entire function which does some extra reshaping.