general questions: for all CmdStanX wrappers, should part of the contract be “minimal library dependencies”? If so, how do we decide when using a library is OK?
context: discussion from https://github.com/stan-dev/cmdstanr/issues/243 -
CmdStanR started using R package vroom to speed up csv reading -
as the name suggests, vroom is faster than read.csv().
(Ideally, Stan io will improve and we won’t need to deal with csv files - that is still a long way off - but definitely something we should work on).
Unfortunately adding a single dependency on vroom (which seemed like a good idea since it’s much faster than read.csv()) is really like adding a lot of dependencies with how intertwined the tidyverse/r-lib/etc packages are with each other. I really wanted to avoid depending on any of those packages but we don’t have the person hours to commit to writing our own fast csv reader!
Thanks for bringing this up @mitzimorris, these are good questions. I definitely think we should try to minimize the dependencies, but I don’t have a great answer for when something is important enough to add a dependency. How much of a speed up in CSV reading is worth risking adding a dependency? I really don’t know.
And some package for dealing with JSON too, right? And os right? Plus all the dependencies of Numpy, Pandas (like how vroom also has its own dependencies).
Currently for CmdStanR the dependencies are (not including dependencies of dependencies):
checkmate (best argument checking package for R I know of, but this could be replaced by a bunch of hand-written arguments checks by us)
processx (for running and controlling system processes. This works much better than using what ships with R for this purpose)
R6 (reference semantics, enables us to offer a methods via the $ operator and have a similar interface between R and Python)
jsonlite (best R package for json that I know of)
vroom (fastest CSV reading package for R that I know of)
We’re also using the posterior package, but we have full control over that.
The easiest dependency to replace would be checkmate (it would just require us to write a bunch of code to check argument types/bounds/etc). The others all really provide essential functionality I think, unless we determine that fast CSV reading isn’t essential. Of course we could write our own versions of all of these packages, but we don’t have the resources.
So far the only one that has been problematic is vroom: in a small number of cases there is an error that neither @rok_cesnovar nor I can reproduce.
This is the first problematic thing that popped up, its seems to be a problem only for R 3.6 and I think it mostly comes down to not propperly set minimum version of the packages used.
I am not sure the comparison is completely fair though. cmdstanpy only relies on numpy and pandas, but they are both huge libraries with tons of functionalities. Yes, anyone doing Bayes stats seriously probably has them by default, but that is probably also true for most of the packages cmdstanr relies on, apart from maybe vroom.
I personally think fast IO is important and the addition of vroom also enables reading only selected columns, which is also very useful imo and will also take some time to do in cmdstan (the issue for this in cmdstan is ~3 years old). We can always revert, would not oppose if that is what we decide.
I kind of doubt binary output will be coming to cmdstan anytime soon. Binary-format-wars are looking to me like the new text-editor-wars and I dont see a resolution everyone will be pleased with in sight.
p.s.: Jonah already wrote most of this by the time I finished typing, but since I had it written I might as well post :)
A minor non-R point: I find the major appeal of working with CmdStan is simply that the C++ toolchain aspects are all debuggable in pure shell, Make, g++ etc. What libraries are involved in the wrapper itself is mostly irrelevant in my opinion, if they have any positive value. If it were me, I would support vroom with fallback to whatever standard parser is available.
@maedoc is referring to the vroom R package we are using in cmdstanr to read the CSV files with samples as opposed to utils::read.csv rstan is using in read_stan_csv.
We use it because its faster in general and because it allows reading in only selected columns of the CSV to not waste memory (and reading only some columns is faster.
At the time of the PR we ran tests (develop = utils::read.csv, PR = vroom)
branch \ num of param
PR - read 50% of parameters
PR - read 2 columns (validation)
PR - read 1 parameter
2000 samples per parameter in all cases. The unit is seconds.
Alternative packages in R that are as fast or in some cases faster than vroom are readr and fread. However, those two struggle with the format of the Stan CSV. Both struggle with the comments inside the CSV table (where we print the step size and inv metric after adaptation ends). The metadata before the column names or the timing printed after the samples is not problematic. vroom by default also has problems with that, but has some options with which we can get around this.
thanks for the detailed explanation.
I didn’t mean to be criticizing the use of vroom in this thread - yes, I was concerned, but now I’m not.
I think that @maedoc’s suggestion of how to handle this is a fine solution and a great way to move forward.
CmdStanPy also struggles with this one - agreed that we need workarounds now and can’t wait for a long-term fix to stan::io.
I hope you don’t mind a suggestion from a user. Consider making vroom optional – put it in the suggests section of the DESCRIPTION file and do a check for its presence in the code, with a fallback if it’s not installed.
I’m still reinstalling packages after updating R to 4.0.0 3 months ago – many of the out of date ones are from the not so tidyverse and it often takes several tries to install a package I want to use because I have to reinstall its dependencies and sometimes dependencies of dependencies (and dependencies…). It’s not a huge problem because I’m on a PC I own and have complete control over, but it’s a bit of a time suck.
That said, I’ll install vroom if I need to. I got a nice speedup (factor of 3.1) in sampling time using reduce_sum() on a model that’s amenable to threading, but I’m giving back some of that because of the extra overhead in file I/O.
Yeah I empathize with that. I definitely want to encourage people to use vroom to get the speedup over read.csv(), but you’re right that we don’t need to force people to install it. So I’d be happy to move vroom from Imports to Suggests. Here’s the pull request:
There is a PR open to adding a simple version to support it, but the PR has been open for 2 months now, so they don’t seem to be particularly eager on adding this. And I kind of doubt that a simple implementation will work with the cmdstan CSV.
Just changing the existing CSV format is going to be problematic as it requires a bump in the major version. And a monolithic CSV has its use-case and appeal. No denying that.
What we could do is add a “wrapper friendly” mode for cmdstan. That mode would have:
more friendly stdout which makes it easier to skip the initial metadata dump (basically easier to show progress, warnings, etc.)
write three separate CSV files: metadata with the step size and inv_metric as the actual CSV data, samples CSV, warmup CSV (optional)
The latter two would have no comments, just your typical CSV: header in the first line and then data. The samples and warmup CSV can also be one merged CSV.
If we did this, we could just use the data.table R package and not worry with any of this. Had I known the dependencies of the vroom package are such a problem, I would have held off until a fast and lighter package was found. I mostly work on Linux where the packages are always installed from source so I always expect a package installation will take some time.
I suggested parts of this a while ago (the parse-friendly stdout) but it was turned down. I did not want to waste any more time with it, so didnt go further. I do still think it would be great for cmdstanr, cmdstanpy as well as the julia wrapper and any other that might pop up.
So are you still going with checkmate? I have implemented various argument checks here. They use the deparse(substitute) mechanism to provide somewhat informative error messages with variable names from one level up in the parse tree, which I think is usually sufficient. They are probably not as optimized for speed as checkmate but rather lightweight.
If you want, I can grep your use of checkmate and see if I have already implemented, or with miminal effort can implement, all checks that you’d need and their unit tests. Then I could submit a PR if you want to drop the dependency on checkmate.