Lightweight interfaces - keeping it light

general questions: for all CmdStanX wrappers, should part of the contract be “minimal library dependencies”? If so, how do we decide when using a library is OK?

context: discussion from Getting error 'Error: Can't find columns `lp__`, `treedepth__`, `divergent__` in `.data`.' · Issue #243 · stan-dev/cmdstanr · GitHub -
CmdStanR started using R package vroom to speed up csv reading -
as the name suggests, vroom is faster than read.csv().
(Ideally, Stan io will improve and we won’t need to deal with csv files - that is still a long way off - but definitely something we should work on).

Jonah notes: Getting error 'Error: Can't find columns `lp__`, `treedepth__`, `divergent__` in `.data`.' · Issue #243 · stan-dev/cmdstanr · GitHub

Unfortunately adding a single dependency on vroom (which seemed like a good idea since it’s much faster than read.csv()) is really like adding a lot of dependencies with how intertwined the tidyverse/r-lib/etc packages are with each other. I really wanted to avoid depending on any of those packages but we don’t have the person hours to commit to writing our own fast csv reader!

CmdStanPy’s original goals, as listed in the README.md: GitHub - stan-dev/cmdstanpy: CmdStanPy is a lightweight interface to Stan for Python users which provides the necessary objects and functions to compile a Stan program and fit the model to data using CmdStan. include this:

  • Clean interface to Stan services so that CmdStanPy can keep up with Stan releases.
  • Easy to install,
    • minimal Python library dependencies: numpy, pandas
    • Python code doesn’t interface directly with c++, only calls compiled executables

CmdStanPy needs NumPy and Pandas and now that it’s 2020, Python 3 seemed ilke an OK requirement as well - this seems to have not met with complaints, so far.

What about R? What are the minimal, stable, useful libraries that we need? Which things have proven to be problematic?

1 Like

Thanks for bringing this up @mitzimorris, these are good questions. I definitely think we should try to minimize the dependencies, but I don’t have a great answer for when something is important enough to add a dependency. How much of a speed up in CSV reading is worth risking adding a dependency? I really don’t know.

And some package for dealing with JSON too, right? And os right? Plus all the dependencies of Numpy, Pandas (like how vroom also has its own dependencies).

Currently for CmdStanR the dependencies are (not including dependencies of dependencies):

  • checkmate (best argument checking package for R I know of, but this could be replaced by a bunch of hand-written arguments checks by us)
  • processx (for running and controlling system processes. This works much better than using what ships with R for this purpose)
  • R6 (reference semantics, enables us to offer a methods via the $ operator and have a similar interface between R and Python)
  • jsonlite (best R package for json that I know of)
  • vroom (fastest CSV reading package for R that I know of)

We’re also using the posterior package, but we have full control over that.

The easiest dependency to replace would be checkmate (it would just require us to write a bunch of code to check argument types/bounds/etc). The others all really provide essential functionality I think, unless we determine that fast CSV reading isn’t essential. Of course we could write our own versions of all of these packages, but we don’t have the resources.

So far the only one that has been problematic is vroom: in a small number of cases there is an error that neither @rok_cesnovar nor I can reproduce.

This is the first problematic thing that popped up, its seems to be a problem only for R 3.6 and I think it mostly comes down to not propperly set minimum version of the packages used.

I am not sure the comparison is completely fair though. cmdstanpy only relies on numpy and pandas, but they are both huge libraries with tons of functionalities. Yes, anyone doing Bayes stats seriously probably has them by default, but that is probably also true for most of the packages cmdstanr relies on, apart from maybe vroom.

I personally think fast IO is important and the addition of vroom also enables reading only selected columns, which is also very useful imo and will also take some time to do in cmdstan (the issue for this in cmdstan is ~3 years old). We can always revert, would not oppose if that is what we decide.

I kind of doubt binary output will be coming to cmdstan anytime soon. Binary-format-wars are looking to me like the new text-editor-wars and I dont see a resolution everyone will be pleased with in sight.

p.s.: Jonah already wrote most of this by the time I finished typing, but since I had it written I might as well post :)

3 Likes

A minor non-R point: I find the major appeal of working with CmdStan is simply that the C++ toolchain aspects are all debuggable in pure shell, Make, g++ etc. What libraries are involved in the wrapper itself is mostly irrelevant in my opinion, if they have any positive value. If it were me, I would support vroom with fallback to whatever standard parser is available.

3 Likes

Good question! Just checked on everything thats being used in CmdStanPy -
aside from Python standard libraries, CmdStanPy only needs numpy and pandas.

numpy says it doesn’t need other Python libs, but it does need linear algebra libs - details here: NumPy - Installing NumPy

pandas needs numpy and also setuptools, date and timezone utils - Installation — pandas 2.1.3 documentation

the JSON thing is coded up to be optional - use ujson if possible, otherwise use the python library

here’s relevant bits of the setup.py config:

INSTALL_REQUIRES = ['numpy', 'pandas']

EXTRAS_REQUIRE = {
    'all': ['tqdm', 'ujson'],
    'tests': ['pytest', 'pytest-cov', 'testfixtures'],
    'docs': [
        'sphinx',
        'sphinx-gallery',
        'sphinx_rtd_theme',
        'numpydoc',
        'matplotlib',
    ],
}

gets fancy quickly if you want to do development, especially if you want to create docs (always the biggest PITA in my experience).

keeping it light ain’t easy - as they say in France: “il faut garder la ligne” (literally: gotta watch the (waist)line).

+1

1 Like

I agree fast IO is important. I think we should keep vroom (it’s really great) and do what @maedoc suggested (fall back to read.csv() if we catch an error).

4 Likes

I hadn’t heard of vroom before. It looks like a persistence framework for C++:

http://ftp.lyx.org/lip6/softs/vroom/vroom.ps

We’d have to be able to control the serialization and not just use defaults, because our low level C++ changes a lot.

@maedoc is referring to the vroom R package we are using in cmdstanr to read the CSV files with samples as opposed to utils::read.csv rstan is using in read_stan_csv.

We use it because its faster in general and because it allows reading in only selected columns of the CSV to not waste memory (and reading only some columns is faster.

At the time of the PR we ran tests (develop = utils::read.csv, PR = vroom)

branch \ num of param 19 643 1283 1923 2563 3203
develop 0.3489878 5.9398425 10.4866965 16.3929579 21.5311923 27.2760901
PR 0.2396772 1.7050161 3.2770655 4.4713771 6.0561974 6.7868321
PR - read 50% of parameters 0.1603785 1.1780379 2.0339272 2.9903495 3.5180790 4.1634433
PR - read 2 columns (validation) 0.1399992 0.3635323 0.6058638 0.8312213 1.1903284 1.3976476
PR - read 1 parameter 0.1341414 0.3551099 0.6020284 0.8367260 1.2255943 1.3311992

2000 samples per parameter in all cases. The unit is seconds.

Alternative packages in R that are as fast or in some cases faster than vroom are readr and fread. However, those two struggle with the format of the Stan CSV. Both struggle with the comments inside the CSV table (where we print the step size and inv metric after adaptation ends). The metadata before the column names or the timing printed after the samples is not problematic. vroom by default also has problems with that, but has some options with which we can get around this.

column names
# adaptiation ended
# stepsizse
# inv metric
1
2
3
4

vroom also enables reading the CSV lazily, but we are not using that ATM as you cant delete/move the CSV files when reading in lazily. It requires a R session restart to do that.

2 Likes

thanks for the detailed explanation.
I didn’t mean to be criticizing the use of vroom in this thread - yes, I was concerned, but now I’m not.
I think that @maedoc’s suggestion of how to handle this is a fine solution and a great way to move forward.

CmdStanPy also struggles with this one - agreed that we need workarounds now and can’t wait for a long-term fix to stan::io.

1 Like

I hope you don’t mind a suggestion from a user. Consider making vroom optional – put it in the suggests section of the DESCRIPTION file and do a check for its presence in the code, with a fallback if it’s not installed.

I’m still reinstalling packages after updating R to 4.0.0 3 months ago – many of the out of date ones are from the not so tidyverse and it often takes several tries to install a package I want to use because I have to reinstall its dependencies and sometimes dependencies of dependencies (and dependencies…). It’s not a huge problem because I’m on a PC I own and have complete control over, but it’s a bit of a time suck.

That said, I’ll install vroom if I need to. I got a nice speedup (factor of 3.1) in sampling time using reduce_sum() on a model that’s amenable to threading, but I’m giving back some of that because of the extra overhead in file I/O.

3 Likes

From the benchmarks here (vroom’s own benchmarks for all numeric data) it looks like data.table is a lot faster right? + data.table has zero depends

2 Likes

Very happy to get suggestions!

Yeah I empathize with that. I definitely want to encourage people to use vroom to get the speedup over read.csv(), but you’re right that we don’t need to force people to install it. So I’d be happy to move vroom from Imports to Suggests. Here’s the pull request:

1 Like

I believe it was the same problem with data.table, unable to parse cmdstan CSV.

Edit: yes, data.table = fread

Edit2: The problem is a lack of comment.char in fread https://github.com/Rdatatable/data.table/issues/856 (issue dates back to 2014)

There is a PR open to adding a simple version to support it, but the PR has been open for 2 months now, so they don’t seem to be particularly eager on adding this. And I kind of doubt that a simple implementation will work with the cmdstan CSV.

Just changing the existing CSV format is going to be problematic as it requires a bump in the major version. And a monolithic CSV has its use-case and appeal. No denying that.

What we could do is add a “wrapper friendly” mode for cmdstan. That mode would have:

  • more friendly stdout which makes it easier to skip the initial metadata dump (basically easier to show progress, warnings, etc.)
  • write three separate CSV files: metadata with the step size and inv_metric as the actual CSV data, samples CSV, warmup CSV (optional)

The latter two would have no comments, just your typical CSV: header in the first line and then data. The samples and warmup CSV can also be one merged CSV.

If we did this, we could just use the data.table R package and not worry with any of this. Had I known the dependencies of the vroom package are such a problem, I would have held off until a fast and lighter package was found. I mostly work on Linux where the packages are always installed from source so I always expect a package installation will take some time.

I suggested parts of this a while ago (the parse-friendly stdout) but it was turned down. I did not want to waste any more time with it, so didnt go further. I do still think it would be great for cmdstanr, cmdstanpy as well as the julia wrapper and any other that might pop up.

4 Likes

Regarding the Stan CSV file format, there’s a complete description of the contents of the sampler output in the new CmdStan guide, with examples and caveats: https://mc-stan.org/docs/2_24/cmdstan-guide/stan-csv.html

Still todo: examples of the CSV file for the other available methods.

2 Likes

Other ideas we could do with the wrapper-friendly-mode, inspired by @mike-lawrence’s ezStan package:

  • write diagnostics in a separate file so they are simple and fast to read
  • have a separate file for only reporting divergence/max treedepth hits that can be reported in real time (possibly enable a feature to stop sampling when divergences occur/max treedepth is hit)
4 Likes

what do you mean by diagnostics? output columns accept_stat__,stepsize__,treedepth__, et al?

Yes.

Hi Jonah,

So are you still going with checkmate? I have implemented various argument checks here. They use the deparse(substitute) mechanism to provide somewhat informative error messages with variable names from one level up in the parse tree, which I think is usually sufficient. They are probably not as optimized for speed as checkmate but rather lightweight.

If you want, I can grep your use of checkmate and see if I have already implemented, or with miminal effort can implement, all checks that you’d need and their unit tests. Then I could submit a PR if you want to drop the dependency on checkmate.

2 Likes