CmdStanPy - ready for beta testing!

CmdStanPy is a lightweight wrapper to CmdStan with a BSD license which is based on Mamaduke Woodman’s excellent PyCmdStan.

juypyter notebook here:

branch is:

there’s an open PR - happy to get feedback on the current implementation - coming to Python from Java and C++ means that it’s probably not very Pythonic, for better or worse ;-)

6 Likes

I just ran the bernoulli model following the tutorial notebook. I usually use CmdStan now, but may need to go back to PyStan for a specific job. I’m not sure what the original goal of CmdStanPy was, but to me it would be useful to have Python organize/process the data (easy with Pystan), run Stan inference (CmdStan allows better control), and have it back in Python for plotting, diagnostics, etc (takes some io to read it back from CmdStan output, but that’s kind of the easy part). CmdStanPy looks like it could be the best of both worlds.

So a few comments/questions:

  • is it possible to use a CmdStan-compiled model, and are all CmdStan arguments accessible (e.g engaged=0)?

  • I see that there are temporary .csv files, is there a way of saving (multiple chains) into the data folder; alternatively, since the CmdStan files are there, is it possible to extract values like the mass matrix within Python?

  • I haven’t used it in a while, but I think PyStan had an extract option that returned a multi-dimensional numpy.ndarray instead of a pandas.DataFrame. Not a big deal, anyway.

yes - construct a Model object specifying both arguments stan_file and exe_file

damn! the answer to that is supposed to be yes, but looks like engaged got overlooked. will file issue…

filing two more features requests for you -

  • yes, there should be a function which lets you specify a directory into which the per-chain csv files are saves
  • yes, it would be possible to write a function to extract the step size and mass matrix from a given chain.

The reason that CmdStanPy’s extract function returns a pandas.DataFrame is so that you can specify a list of parameter names and it will return a view on the DataFrame - at least that’s the intent - please test to break!

1 Like

Thanks. I will soon be running some serious analyses and maybe I will have some more constructive feedback, some things like array vs data frame are mostly convenience, anyway, and this looks great, already.

Hey, if it is possible to save csv files, then they can be read with ArviZ az.from_cmdstan.

I think I could implement from_cmdstanpy at some point (after we have a good idea what will be the default output).

the PosteriorSample object has attributes column_names and sample - the latter is the ndarray that contains the entire sample stored column_major (F format) as ( draws X chains X parameters ) and the former are the column labels.

for Arviz, can you use the convert_to_inference_data function?
https://arviz-devs.github.io/arviz/generated/arviz.convert_to_inference_data.html#arviz.convert_to_inference_data

We need to first implement from_cmdstanpy or input a dict in it.

I think I can implement from_cmdstanpy with PosteriorSample.

This enables one to call az.plot_trace(fit).

1 Like

filed a bunch of issues -




the CmdStanPy equivalent of CmdStan’s engaged argument was renamed to do_adaptation. renamed by me and perhaps this was a bad decision. in any case, do_adaptation is a bool and so do_adaptation = True == engaged=1, and False == 0.

I’m adding args for init_buffer, window, and term_buffer which correspond to the initial fast adaptation interval, slow adaptation interval, and final fast adaptation interval.

I agree that do_adaptation is more informative than engage. In RStan this is called adapt_engaged. It seems it would be useful to have a bit more coordination between interfaces and bit more discussion in the future about the option (and function names). Related to that, there is this recent nice blog post https://djnavarro.net/post/be-some-other-name/

1 Like

agreed that adapt_engaged is best, and since it’s early days, I will change it!

we are serving two masters: the devil we know and the devil we don’t. the devil we know are folks using Stan to do HMC sampling via one of the existing interfaces. for them, we should keep the names as similar as possible. the devil we don’t know are all the folks using Python but not using Stan, PyStan, or PyMC3, but who should be. for them, we want argument names that make as much sense as possible. if the existing names make as much sense as possible, no problem - do they?

to make this concrete, a spreadsheet of what each interface calls all the controls on the HMC sampler would help. I’ll try and get one together - or maybe one exists, in which case, please point me to it.

pystan 3 uses cmdstan argument/parameter names and defaults. I went so
far as to write a script to parse the output of cmdstan’s help in order
to make sure they match exactly.

great - where’s the script?

I agree, and we should be ready to change the argument names also if we realize that they are not making sense. Especially, the variational inference is labeled as experimental, so no-one should rely it to be stable even in option names…

Yes, that would be helpful and even if would not change something now, it would help when adding new interfaces, new options, or when planning Stan 3.

Here it is:

Here’s a file with the output (likely from CmdStan 2.17): https://github.com/stan-dev/httpstan/blob/master/httpstan/cmdstan-help-all.json

1 Like

Yes, this is exactly the design and intention.

Yes.

Yes, but they’re going to be reorganized and flattened.

Yes.

It will be.

Please don’t use an engaged argument. If the number of warmup iterations is zero, there will be no adaptation. I don’t think we need a no-adaptation option that also throws away some set of initial burn-in draws. The only time we won’t want to adapt is when we’re restarting.

I think the skeleton commands in the CmdStan doc are what you want. That’s the set of options you have available from CmdStan.

The issue here is that for large models some of the chains may not optimize the sampler options equally well, so the option I found was to run long burn-in chains and manually select a step size and mass matrix from those. The following run would use them without any further optimization optimization.
I understand that for most cases adaptation is required, but the option should be accessible in case something like this is needed (or the possibly rare case where the model is parameterized for a unit mass matrix).

What I was suggesting is that we don’t need an adapt-engaged argument. If the number of warmup iterations is set to zero, there won’t be any adaptation.

You didn’t say so explicitly, but it sounds like you want to run without adaptation of warmup or step size, but still throw away some initial sequence of draws. That is, you want a BUGS or JAGS “burn in” option. My inclination would be to not support that in our interface and let users prune draws after the fact manually. The downside is that it’d require more memory for you to save everyting then throw the first half out.

One approach may be to write separate functions. That’s how everything is broken out at the C++ level. There’s a Euclidean diagonal metric version of NUTS with adaptation and a completely separate function without, and they both allow you to throw away some initial set of draws. We may go that way if we really do need to support this edge case. I’d really like to avoid the kind of tangled set of argument dependencies we’d get with an adapt-engaged argument.

Thanks for commenting, by the way. We really appreciate hearing about how people are using Stan. It’s super helpful for design going forward.

We never supported the full cross-product of possibilities for sampling. We do allow unit metrics with adapted step sizes, but we don’t allow other constant mass matrices with adapted step sizes.

I’ve also been lobbying for getting rid of the fixed unit metrics in favor of just turning adaptation off. That will lose the unit metric plus adapt step size option we currently provide.

Right, that makes sense, but I think warmup=0 wouldn’t work (at least in previous versions of I think PyStan), not sure why but it wouldn’t actually sample (maybe the step/mass matrix were too far off, but I think it was something else that broke). That (plus not being able to supply a step size and mass matrix) is the reason I started using CmdStan, in fact.

Not really. I just would like to be able to turn off adaptation and provide a fixed step size and mass matrix. I don’t think you have to support the entire cross-product, but it looked like some things were only possible with CmdStan.

I think it’s important to point out the importance of adaptation (like you did) and not make it “too easy” to run Stan without it (as opposed to MH, for there is no conceptual difference between burn-in and sampling phase if scale is tuned throughout). Most times CmdStan will do the job, but R/Python data frames are a lot more convenient for data processing than shell scripts, so I guess most people, myself included, will be happy to restrict bash to a minimum.