Lightweight interfaces - keeping it light

@maedoc is referring to the vroom R package we are using in cmdstanr to read the CSV files with samples as opposed to utils::read.csv rstan is using in read_stan_csv.

We use it because its faster in general and because it allows reading in only selected columns of the CSV to not waste memory (and reading only some columns is faster.

At the time of the PR we ran tests (develop = utils::read.csv, PR = vroom)

branch \ num of param 19 643 1283 1923 2563 3203
develop 0.3489878 5.9398425 10.4866965 16.3929579 21.5311923 27.2760901
PR 0.2396772 1.7050161 3.2770655 4.4713771 6.0561974 6.7868321
PR - read 50% of parameters 0.1603785 1.1780379 2.0339272 2.9903495 3.5180790 4.1634433
PR - read 2 columns (validation) 0.1399992 0.3635323 0.6058638 0.8312213 1.1903284 1.3976476
PR - read 1 parameter 0.1341414 0.3551099 0.6020284 0.8367260 1.2255943 1.3311992

2000 samples per parameter in all cases. The unit is seconds.

Alternative packages in R that are as fast or in some cases faster than vroom are readr and fread. However, those two struggle with the format of the Stan CSV. Both struggle with the comments inside the CSV table (where we print the step size and inv metric after adaptation ends). The metadata before the column names or the timing printed after the samples is not problematic. vroom by default also has problems with that, but has some options with which we can get around this.

column names
# adaptiation ended
# stepsizse
# inv metric
1
2
3
4

vroom also enables reading the CSV lazily, but we are not using that ATM as you cant delete/move the CSV files when reading in lazily. It requires a R session restart to do that.

2 Likes