Various observations on rstan, cmdstanr, pystan and cmdstanpy after teaching with all in parallel

Thanks for the observations, this is really valuable.

Following comments are not here to say there we should not learn from these problems.

We probably would want to use the same keywords in all interfaces (now pystan and cmdstanpy differ a bit)

In the simple example it sounds easy

(reading)

import json
with open("path/to/file.json") as f:
    data_dict = json.load(f)

(writing)

import json
with open("path/to/file.json", "w") as f:
    json.dump(data_dict, f)

But this of course will fail with numpy arrays and such. Currently both pystan and cmdstanpy has similar functions in the background (but they handle numpy arrays too), so there could a user facing function to use them.

… I feel your pain, still I like the stan name :D

This is kind of a problem in Jupyter, which means it should be fixed there.
One option would always call nest-asyncio when using pystan, but I see many potential bugs in this solution.

This was selected after long conversation (model + data → posterior → sample posterior)

pystan (httpstan actually) caches the model so there is not really recompilation but reading the model from cache. So it is ok to call the build command in a for loop multiple times.

At some point we probably want recheck our all IO in cmdstanpy. Tbh I’m not sure what was the current solution for this.

4 Likes

Just a thought on this – It seems possible to effectively alias pystan and stan in one installation in Python. The idea being you create a new package directory in the repo:

stan/
     __init__.py
     common.py
     fit.py
     model.py
     plugins.py
pystan/
     __init__.py
pyproject.toml
setup.cfg

And within the pystan/__init__.py file, you have:

from stan import *

Then everything in the stan namespace is copied over to the pystan namespace as well. Only additional thing you need to do is tell the build instructions to install both the stan and pystan packages, which is accomplished in the pyproject.toml:

packages = [
    { include = "stan" },
    { include = "pystan" },
]

I tested this out locally and it seems to work without issue. Could alleviate some of the issues that seem to come up when people install pystan and can’t import it with the same name. (Additionally, it looks like @ariddell has the stan package on pypi, so in theory you could have the build instructions push the package to both the pystan and stan pypi repositories.

I know the idea of having two ways to install/import the same package is kind of insane, but it’s a thought.

To continue hijacking this thread.

One option would be to have pystan do some introspection on import to check whether or not it is being run within a Jupyter notebook. For example, putting the following code snippet in stan/__init__.py would run nest-asyncio whenever the package is imported in a notebook:

try:
    ipython_shell = get_ipython().__class__.__name__
    if ipython_shell == 'ZMQInteractiveShell':
        import nest_asyncio
        nest_asyncio.apply()
except NameError:
    pass

I tried it out locally and it worked fine – in practice, if people already need to invoke nest-asyncio when running in Jupyter, this could be a fine band-aid to use. It really doesn’t look like a proper fix is forthcoming any time soon.

Thanks for the ideas.

That would be really helpful for anyone who isn’t highly experienced with Python. I give students the data in CSV, .rds for R and .json for Python, but the real world isn’t so kind! And I know most researchers don’t want to think about intermediate data storage formats, they just want to hit the button and get Bayesian!

That’s great – I didn’t know that. I still prefer compiling a sampler that contains a likelihood and a prior, and after compilation will take data, inits, seed, etc as input, but I can live with it.

3 Likes

I also find the current pystan behavior absolutely confusing. I want to build the model, and then maybe set the data that induce the posterior, but I don’t see how you can build a posterior.

I completely agree.

Could you point to this long conversation? I would like to be convinced that this makes any sense at all. (I have commented elsewhere that it is a pain point for upgrading the jupyter/stan %magic functions which operate on a code block for compilation, and therefore don’t naturally have access to the data.)

Ok, well, thanks.

I guess I disagree with the outcome — users do, in fact, understand that the stan model program is separate from the specific data-set, and forcing those two things together as “build” breaks that mental model. The stan program is 100% independent of the values (but not the form) of the data, which is why stanc doesn’t need it.

In practice, with caching, this isn’t that big a deal in many use cases (but not all: the aforementioned Jupyter %stan magic just does make more sense with separate compilation and instantiation with data).

Moreover, I think it is not always a good pattern for a wrapper to ignore the implementation details, since at the very least it’s a good idea for users to understand what is going on in case things go wrong…

But this ship appears to have sailed.

Just chiming in to point out that the {model + data} framework also kinda misfits in the context of SBC, which in turn is becoming more common/recommended for workflows.

CmdStanPy most certainly accepts filenames for data, and does provide its own data-writing function - it’s in both the doc and the very first “hello world” example: “Hello, World” — CmdStanPy 0.9.76 documentation

The data can be specified either as a filepath or a Python dict; in this example, we use the example datafile bernoulli.data.json:

here’s the doc for sample inference method, and methods for optimization, variational inference, and standalone generated quantities do the same:

data (Optional [ Union [ Dict, str ]] ) – Values for all data variables in the model,
specified either as a dictionary with entries matching the data variables,
or as the path of a data file in JSON or Rdump format.

if the goal is to understand Stan and use existing models, then perhaps it would be better to only teach rstanarm and/or brms. If you’re teaching Stan in order to help people develop their own models, then they will most likely struggle with compilation, and getting a model to compile will be a nice milestone.

agreed, we need much more in the way of documentation. there is a neglected subfolder of jupyter notebooks. another way to go would be to translate existing case studies to CmdStanPy and I would be more than happy to help folks do this.

thanks for the wonderful and extensive feedback!

2 Likes

+100 as the emoji goes

@robertgrant : would love to hear how the installation process went. we know it’s a huge pain point - any insights would be much appreciated

also, does choice of Python or R correspond to any difference in overall level of programming ability, specifically, ability to manage input and output data files, organize work, and write general-purpose functions? or are there lots of students who want to get into this data science thing without being able to work from the shell and navigate the filesystem among both R and Python users?

2 Likes

Those examples seem to me to assume that bernoulli.data.json already exists.

I can’t comment on the internal data-file-writing functionality as it’s not in the API. It appears to call python json because it throws an error on numpy arrays in just the same way. It would be nice to have it available to all users. (I don’t know if “expose” is strictly the right term for that.) And if it can’t deal with numpy, then it would be handy to give some advice to users at least, and better still to provide a simple writing function that takes everything via strings. (I find myself pining for C++ here, and that does not happen often)

Absolutely fine in all interfaces from my experience, which is either Mac or Linux.

I don’t have enough data to say. It’s an important question though. What I can say with some confidence is that, across various training, coaching and consulting jobs, my prejudices about R vs Python users are constantly proven wrong.

“…you guys think I don’t give you straight answers. You have to talk to these statisticians. They will not give you a direct answer on anything. We don’t know. See, when did you start with the first two options, it’s either straight up and straight down, right? Or a total V, or maybe it’s up with a plateau and we’re somewhere on the plateau. They don’t know.” — 5 April 2020, Governor Andrew Cuomo COVID-19 Briefing April 5: Hospitalizations Dropping in NY - Rev, 20:35.

3 Likes

this is a good idea. filed issue need utility function "stan_json" · Issue #410 · stan-dev/cmdstanpy · GitHub

if you could please add an example of data that causes json errors, that would be great, as the existing utility function jsondump has logic that’s supposed to handle this.

3 Likes

Thanks for prioritising this @mitzimorris. I see that write_stan_json is now in the 0.9.77 release. I don’t have the exact setup that caused problems anymore; I’d have to spin up another server with it, so I will test it out for sure in the fullness of time.

1 Like

I thought we had a solution to this: we use cmdstan’s argument names. What’s left to do? Are there changes that need to be made in pystan?

rstan does its own thing, at least as of this summer.

names for CmdStanX wrappers for the number of warmup and sampling iterations are iter_warmup and iter_sampling, respectively.

this was to make it clear that these are the number of iterations. because it’s possible to thin the reported draws, the number of draws returned is not necessarily the same as the number of iterations, sometimes folks got confused by what was meant by num_samples, iter_sampling is (perhaps) more clear.

the difference between rstan and cmdstan et al remains - in rstan iter is the total number of iterations and warmup is the number of that total which are warmup iterations, so that the number of sampling iterations is iter - warmup and if warmup isn’t specified, then the number of warmup and sampling iterations are both iter / 2.

This may sound pedantic, but I mention it just because you say the rationale for iter_* is that it’s the number of iterations… it’s not. It’s the number of accepted proposals. There are more iterations (not counting leapfrogging) that lead to proposals that are rejected. It’s like a while loop instead of a for loop.

1 Like

No, this is not true. See:

library(rstan)
#> Loading required package: StanHeaders
#> rstan (Version 2.26.0.9000, GitRev: 2e1f913d3ca3)
#> For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores()).
#> To avoid recompilation of unchanged Stan programs, we recommend calling
#> rstan_options(auto_write = TRUE)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:rstan':
#> 
#>     extract
res <- stan(
  model_code = 'data {} parameters{real x;} model{x ~ std_normal();}',
  chains = 1,
  control = list(adapt_delta = 0.5)
)
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 9e-06 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 0.007 seconds (Warm-up)
#> Chain 1:                0.006 seconds (Sampling)
#> Chain 1:                0.013 seconds (Total)
#> Chain 1:
#> Warning: There were 78 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems
#> Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
#> Running the chains for more iterations may help. See
#> https://mc-stan.org/misc/warnings.html#bulk-ess
#> Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
#> Running the chains for more iterations may help. See
#> https://mc-stan.org/misc/warnings.html#tail-ess
samples <- rstan::extract(res, pars = 'x', permuted = FALSE)

# if these were all 'accepted' then this would be a vector of length 100, 
# as each 'sample' would be an 'accepted' propsal (which would 
# necessarily differ)  from the original point, and thus
samples %>%
  magrittr::extract(1 : 1000, ,) %>% 
  round(digits = 8) %>% # floating point error will do funny things otherwise
  unique() %>% 
  length()
#> [1] 230

Created on 2021-10-07 by the reprex package (v2.0.1)

adapt_delta is the target acceptance rate, so you can make unique(samples) differ arbitrarily from 1000 (well, between 1 and 1000) by decreasing its value.

1 Like