@maedoc is referring to the vroom R package we are using in cmdstanr to read the CSV files with samples as opposed to utils::read.csv rstan is using in read_stan_csv.
We use it because its faster in general and because it allows reading in only selected columns of the CSV to not waste memory (and reading only some columns is faster.
At the time of the PR we ran tests (develop = utils::read.csv, PR = vroom)
branch \ num of param | 19 | 643 | 1283 | 1923 | 2563 | 3203 |
---|---|---|---|---|---|---|
develop | 0.3489878 | 5.9398425 | 10.4866965 | 16.3929579 | 21.5311923 | 27.2760901 |
PR | 0.2396772 | 1.7050161 | 3.2770655 | 4.4713771 | 6.0561974 | 6.7868321 |
PR - read 50% of parameters | 0.1603785 | 1.1780379 | 2.0339272 | 2.9903495 | 3.5180790 | 4.1634433 |
PR - read 2 columns (validation) | 0.1399992 | 0.3635323 | 0.6058638 | 0.8312213 | 1.1903284 | 1.3976476 |
PR - read 1 parameter | 0.1341414 | 0.3551099 | 0.6020284 | 0.8367260 | 1.2255943 | 1.3311992 |
2000 samples per parameter in all cases. The unit is seconds.
Alternative packages in R that are as fast or in some cases faster than vroom are readr and fread. However, those two struggle with the format of the Stan CSV. Both struggle with the comments inside the CSV table (where we print the step size and inv metric after adaptation ends). The metadata before the column names or the timing printed after the samples is not problematic. vroom by default also has problems with that, but has some options with which we can get around this.
column names
# adaptiation ended
# stepsizse
# inv metric
1
2
3
4
vroom also enables reading the CSV lazily, but we are not using that ATM as you cant delete/move the CSV files when reading in lazily. It requires a R session restart to do that.