Various observations on rstan, cmdstanr, pystan and cmdstanpy after teaching with all in parallel

I recently ran an introductory course for the Royal Statistical Society, which I did the last couple of years with rstan only, but now for the first time I offered a choice of rstan, pystan, cmdstanr or cmdstanpy. This gave me an opportunity to closely compare them. I offer up these pretty paltry observations in case they inspire ideas among developers of these and other interfaces.

I see some effort to converge on the names of arguments bearing fruit. It would be nice to have more of this. Of course, its possible to have two options available, one deprecated, like warmup= and num_warmup=

Suppose you want 1000 warmup draws and then 1000 retained after that. rstan requires warmup=1000, iter=2000, while all the others will have some form of num_warmup=1000,num_sample=1000. That’s a potential trap for the unwary.

rstan and cmdstanr accept a list or a filename for data. Cmdstanpy and pystan want only a dictionary. Writing and reading from JSON in Python is a massive pain, at least in my experience, so it would be great if pystan and cmdstanpy took data files as an argument, and also offered their own data-writing function that could handle numpy arrays and the various float lengths that might entail.

cmdstanr, pystan and cmdstanpy all require you to compile the model and then sample. rstan allows a single function stan() to rule them all. I’m not saying I like rstan::stan(), just that it’s good to have options for everyone to pick from, especially beginners.

cmdstanr, cmdstanpy and pystan all have syntax for sampling that looks and feels like applying a sample method to a model object. I like that, I suppose others might not. In R, especially, it is still quite unusual and many people expect functions everywhere. rstan is all about the functions.

In general, I am annoyed by Python packages that are called x for pip install and then when you go to import them, they are suddenly called y. Like pystan. Maybe that’s just me, but every time I explain to learners that they have to look out for things like that, it deflates their enthusiasm a little.

It’s a shame about Jupyter notebooks, pystan and nest-asyncio. I don’t know enough about the issue to comment. Maybe there’s a way to get around it.

pystan requires you to provide the data and the seed when you compile the model. Maybe you can overrule them in the sample method’s kwargs, I don’t know. This hurts my simulation study soul, because if I want to swap the data repeatedly, I don’t want to recompile or even check for code changes each time.

I think that the only way to have different initial values for chains in cmdstanpy is to supply a file (JSON? oh no).

I know readthedocs is convenient, and I’m not criticising anyone because my documentation is the worst, but really the hello world examples there are not sufficient for beginners, and the layout of the API reference takes some getting used to.

I’m thinking of just running with the two cmdstan interfaces next year, but we’ll see. I almost entirely use cmdstanr now, and as I said before, I’m advising any StataStan users to switch to cmdstanpy via the Python integration feature in Stata 16+ – and I’ll be writing a tutorial on that soon.

Ciao ciao
Robert

16 Likes

Useful writeup!

I think it’s very easy for the more experienced of us to ignore the warts because we’re used to them. Fresh eyes make the issues much more obvious.

Thanks for the observations, this is really valuable.

Following comments are not here to say there we should not learn from these problems.

We probably would want to use the same keywords in all interfaces (now pystan and cmdstanpy differ a bit)

In the simple example it sounds easy

(reading)

import json
with open("path/to/file.json") as f:
    data_dict = json.load(f)

(writing)

import json
with open("path/to/file.json", "w") as f:
    json.dump(data_dict, f)

But this of course will fail with numpy arrays and such. Currently both pystan and cmdstanpy has similar functions in the background (but they handle numpy arrays too), so there could a user facing function to use them.

… I feel your pain, still I like the stan name :D

This is kind of a problem in Jupyter, which means it should be fixed there.
One option would always call nest-asyncio when using pystan, but I see many potential bugs in this solution.

This was selected after long conversation (model + data → posterior → sample posterior)

pystan (httpstan actually) caches the model so there is not really recompilation but reading the model from cache. So it is ok to call the build command in a for loop multiple times.

At some point we probably want recheck our all IO in cmdstanpy. Tbh I’m not sure what was the current solution for this.

4 Likes

Just a thought on this – It seems possible to effectively alias pystan and stan in one installation in Python. The idea being you create a new package directory in the repo:

stan/
     __init__.py
     common.py
     fit.py
     model.py
     plugins.py
pystan/
     __init__.py
pyproject.toml
setup.cfg

And within the pystan/__init__.py file, you have:

from stan import *

Then everything in the stan namespace is copied over to the pystan namespace as well. Only additional thing you need to do is tell the build instructions to install both the stan and pystan packages, which is accomplished in the pyproject.toml:

packages = [
    { include = "stan" },
    { include = "pystan" },
]

I tested this out locally and it seems to work without issue. Could alleviate some of the issues that seem to come up when people install pystan and can’t import it with the same name. (Additionally, it looks like @ariddell has the stan package on pypi, so in theory you could have the build instructions push the package to both the pystan and stan pypi repositories.

I know the idea of having two ways to install/import the same package is kind of insane, but it’s a thought.

To continue hijacking this thread.

One option would be to have pystan do some introspection on import to check whether or not it is being run within a Jupyter notebook. For example, putting the following code snippet in stan/__init__.py would run nest-asyncio whenever the package is imported in a notebook:

try:
    ipython_shell = get_ipython().__class__.__name__
    if ipython_shell == 'ZMQInteractiveShell':
        import nest_asyncio
        nest_asyncio.apply()
except NameError:
    pass

I tried it out locally and it worked fine – in practice, if people already need to invoke nest-asyncio when running in Jupyter, this could be a fine band-aid to use. It really doesn’t look like a proper fix is forthcoming any time soon.

Thanks for the ideas.

That would be really helpful for anyone who isn’t highly experienced with Python. I give students the data in CSV, .rds for R and .json for Python, but the real world isn’t so kind! And I know most researchers don’t want to think about intermediate data storage formats, they just want to hit the button and get Bayesian!

That’s great – I didn’t know that. I still prefer compiling a sampler that contains a likelihood and a prior, and after compilation will take data, inits, seed, etc as input, but I can live with it.

3 Likes

I also find the current pystan behavior absolutely confusing. I want to build the model, and then maybe set the data that induce the posterior, but I don’t see how you can build a posterior.

I completely agree.

Could you point to this long conversation? I would like to be convinced that this makes any sense at all. (I have commented elsewhere that it is a pain point for upgrading the jupyter/stan %magic functions which operate on a code block for compilation, and therefore don’t naturally have access to the data.)

Ok, well, thanks.

I guess I disagree with the outcome — users do, in fact, understand that the stan model program is separate from the specific data-set, and forcing those two things together as “build” breaks that mental model. The stan program is 100% independent of the values (but not the form) of the data, which is why stanc doesn’t need it.

In practice, with caching, this isn’t that big a deal in many use cases (but not all: the aforementioned Jupyter %stan magic just does make more sense with separate compilation and instantiation with data).

Moreover, I think it is not always a good pattern for a wrapper to ignore the implementation details, since at the very least it’s a good idea for users to understand what is going on in case things go wrong…

But this ship appears to have sailed.

Just chiming in to point out that the {model + data} framework also kinda misfits in the context of SBC, which in turn is becoming more common/recommended for workflows.

CmdStanPy most certainly accepts filenames for data, and does provide its own data-writing function - it’s in both the doc and the very first “hello world” example: https://mc-stan.org/cmdstanpy/hello_world.html#run-the-hmc-nuts-sampler

The data can be specified either as a filepath or a Python dict; in this example, we use the example datafile bernoulli.data.json:

here’s the doc for sample inference method, and methods for optimization, variational inference, and standalone generated quantities do the same:

data (Optional [ Union [ Dict, str ]] ) – Values for all data variables in the model,
specified either as a dictionary with entries matching the data variables,
or as the path of a data file in JSON or Rdump format.

if the goal is to understand Stan and use existing models, then perhaps it would be better to only teach rstanarm and/or brms. If you’re teaching Stan in order to help people develop their own models, then they will most likely struggle with compilation, and getting a model to compile will be a nice milestone.

agreed, we need much more in the way of documentation. there is a neglected subfolder of jupyter notebooks. another way to go would be to translate existing case studies to CmdStanPy and I would be more than happy to help folks do this.

thanks for the wonderful and extensive feedback!

2 Likes

+100 as the emoji goes

@robertgrant : would love to hear how the installation process went. we know it’s a huge pain point - any insights would be much appreciated

also, does choice of Python or R correspond to any difference in overall level of programming ability, specifically, ability to manage input and output data files, organize work, and write general-purpose functions? or are there lots of students who want to get into this data science thing without being able to work from the shell and navigate the filesystem among both R and Python users?

2 Likes

Those examples seem to me to assume that bernoulli.data.json already exists.

I can’t comment on the internal data-file-writing functionality as it’s not in the API. It appears to call python json because it throws an error on numpy arrays in just the same way. It would be nice to have it available to all users. (I don’t know if “expose” is strictly the right term for that.) And if it can’t deal with numpy, then it would be handy to give some advice to users at least, and better still to provide a simple writing function that takes everything via strings. (I find myself pining for C++ here, and that does not happen often)

Absolutely fine in all interfaces from my experience, which is either Mac or Linux.

I don’t have enough data to say. It’s an important question though. What I can say with some confidence is that, across various training, coaching and consulting jobs, my prejudices about R vs Python users are constantly proven wrong.

“…you guys think I don’t give you straight answers. You have to talk to these statisticians. They will not give you a direct answer on anything. We don’t know. See, when did you start with the first two options, it’s either straight up and straight down, right? Or a total V, or maybe it’s up with a plateau and we’re somewhere on the plateau. They don’t know.” — 5 April 2020, rev.com/blog/transcripts/governor-andrew-cuomo-covid-19-briefing-april-5-hospitalizations-dropping-in-ny, 20:35.

3 Likes

this is a good idea. filed issue need utility function "stan_json" · Issue #410 · stan-dev/cmdstanpy · GitHub

if you could please add an example of data that causes json errors, that would be great, as the existing utility function jsondump has logic that’s supposed to handle this.

3 Likes

Thanks for prioritising this @mitzimorris. I see that write_stan_json is now in the 0.9.77 release. I don’t have the exact setup that caused problems anymore; I’d have to spin up another server with it, so I will test it out for sure in the fullness of time.

1 Like

I thought we had a solution to this: we use cmdstan’s argument names. What’s left to do? Are there changes that need to be made in pystan?

rstan does its own thing, at least as of this summer.

names for CmdStanX wrappers for the number of warmup and sampling iterations are iter_warmup and iter_sampling, respectively.

this was to make it clear that these are the number of iterations. because it’s possible to thin the reported draws, the number of draws returned is not necessarily the same as the number of iterations, sometimes folks got confused by what was meant by num_samples, iter_sampling is (perhaps) more clear.

the difference between rstan and cmdstan et al remains - in rstan iter is the total number of iterations and warmup is the number of that total which are warmup iterations, so that the number of sampling iterations is iter - warmup and if warmup isn’t specified, then the number of warmup and sampling iterations are both iter / 2.