Sharing one more update here…
I made a simple binary format and some code to read it neatly. The interface could use more work and error checking but I’m excited to see it works so well, particularly in terms of managing memory requirements. Next step for me is adding a ‘binary’ output argument to CmdStan to write this format directly w/o bothering with .csv
The description of the format is here: https://www.overleaf.com/read/ctpjdwsbrvbw
The package is the same one: https://github.com/sakrejda/stannis
The memory requirements during .csv -> binary conversion are: mostly near-zero, when reshaping each named parameter is loaded so you do need to be able to hold each parameter as a std::vector<double>
at a time. That could be fixed with mmap but that’s not ready yet. For the moment the syntax to rewrite a .csv file to binary is (dir must exist):
library(stannis)
run = stannis::read_run(root = path_to_output_csv, uuid = NULL)
To extract a parameter you pass the directory and parameter name to a function:
theta = stannis::get_parameter('./sample', 'theta')
theta_ = stannis::get_parameter('./sample', 'theta', mmap=TRUE)
Both return an array with correct dimensions (matching rstan) but the second form uses mmap which doesn’t load into memory until you subscript it. The discrepancy is less than 1.1e-16 vs. rstan
The speed was a few times faster than rstan::read_stan_csv
on R 3.5 (100 replicates, 30 Mb output.csv). All timings are in seconds:
> mean(timing_stannis)
[1] 2.01799
> mean(timing_rstan)
[1] 5.20083
> sd(timing_stannis)
[1] 0.03693852
> sd(timing_rstan)
[1] 0.09013593
> mean(timing_rstan/timing_stannis)
[1] 2.577687
> sd(timing_rstan/timing_stannis)
[1] 0.04634689
Version info:
R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
...
[1] "magrittr"
> library(rstan)
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
Doing the parsing is a bunch of C++ but getting CmdStan to produce binary would be quite a bit shorter.