CmdStanPy - ready for beta testing!

mitzimorris · May 11, 2019, 1:15pm

CmdStanPy is a lightweight wrapper to CmdStan with a BSD license which is based on Mamaduke Woodman’s excellent PyCmdStan.

juypyter notebook here:

github.com

stan-dev/cmdstanpy/blob/refactor/api-per-func-spec/cmdstanpy_tutorial.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CmdStanPy Tutorial\n",
    "\n",
    "\n",
    "### Workflow Outline\n",
    "\n",
    "Given a dataset and a model specification written as a Stan program, the CmdStanPy workflow is:\n",
    "\n",
    "#### Assemble input data as either:\n",
    "  + a Python `dict` object consisting of key-value pairs where the key corresponds\n",
    " to Stan data variables and the value is of the correct type and shape.\n",
    "  + an existing data file on disk in either JSON or Rdump format.\n",
    "  + CmdStanPy provides methods for converting former to latter.\n",
    "\n",
    "\n",

This file has been truncated. show original

branch is:

there’s an open PR - happy to get feedback on the current implementation - coming to Python from Java and C++ means that it’s probably not very Pythonic, for better or worse ;-)

caesoma · May 16, 2019, 6:33pm

I just ran the bernoulli model following the tutorial notebook. I usually use CmdStan now, but may need to go back to PyStan for a specific job. I’m not sure what the original goal of CmdStanPy was, but to me it would be useful to have Python organize/process the data (easy with Pystan), run Stan inference (CmdStan allows better control), and have it back in Python for plotting, diagnostics, etc (takes some io to read it back from CmdStan output, but that’s kind of the easy part). CmdStanPy looks like it could be the best of both worlds.

So a few comments/questions:

is it possible to use a CmdStan-compiled model, and are all CmdStan arguments accessible (e.g engaged=0)?
I see that there are temporary .csv files, is there a way of saving (multiple chains) into the data folder; alternatively, since the CmdStan files are there, is it possible to extract values like the mass matrix within Python?
I haven’t used it in a while, but I think PyStan had an extract option that returned a multi-dimensional numpy.ndarray instead of a pandas.DataFrame. Not a big deal, anyway.

mitzimorris · May 16, 2019, 7:12pm

yes - construct a Model object specifying both arguments stan_file and exe_file

damn! the answer to that is supposed to be yes, but looks like engaged got overlooked. will file issue…

filing two more features requests for you -

yes, there should be a function which lets you specify a directory into which the per-chain csv files are saves
yes, it would be possible to write a function to extract the step size and mass matrix from a given chain.

The reason that CmdStanPy’s extract function returns a pandas.DataFrame is so that you can specify a list of parameter names and it will return a view on the DataFrame - at least that’s the intent - please test to break!

caesoma · May 16, 2019, 7:24pm

Thanks. I will soon be running some serious analyses and maybe I will have some more constructive feedback, some things like array vs data frame are mostly convenience, anyway, and this looks great, already.

ahartikainen · May 16, 2019, 7:31pm

Hey, if it is possible to save csv files, then they can be read with ArviZ az.from_cmdstan.

I think I could implement from_cmdstanpy at some point (after we have a good idea what will be the default output).

mitzimorris · May 16, 2019, 8:08pm

the PosteriorSample object has attributes column_names and sample - the latter is the ndarray that contains the entire sample stored column_major (F format) as ( draws X chains X parameters ) and the former are the column labels.

for Arviz, can you use the convert_to_inference_data function?
https://arviz-devs.github.io/arviz/generated/arviz.convert_to_inference_data.html#arviz.convert_to_inference_data

ahartikainen · May 16, 2019, 8:14pm

We need to first implement from_cmdstanpy or input a dict in it.

I think I can implement from_cmdstanpy with PosteriorSample.

This enables one to call az.plot_trace(fit).

mitzimorris · May 16, 2019, 8:26pm

filed a bunch of issues -

mitzimorris · May 17, 2019, 12:21pm

the CmdStanPy equivalent of CmdStan’s engaged argument was renamed to do_adaptation. renamed by me and perhaps this was a bad decision. in any case, do_adaptation is a bool and so do_adaptation = True == engaged=1, and False == 0.

I’m adding args for init_buffer, window, and term_buffer which correspond to the initial fast adaptation interval, slow adaptation interval, and final fast adaptation interval.

avehtari · May 17, 2019, 12:48pm

I agree that do_adaptation is more informative than engage. In RStan this is called adapt_engaged. It seems it would be useful to have a bit more coordination between interfaces and bit more discussion in the future about the option (and function names). Related to that, there is this recent nice blog post https://djnavarro.net/post/be-some-other-name/

mitzimorris · May 17, 2019, 1:22pm

agreed that adapt_engaged is best, and since it’s early days, I will change it!

we are serving two masters: the devil we know and the devil we don’t. the devil we know are folks using Stan to do HMC sampling via one of the existing interfaces. for them, we should keep the names as similar as possible. the devil we don’t know are all the folks using Python but not using Stan, PyStan, or PyMC3, but who should be. for them, we want argument names that make as much sense as possible. if the existing names make as much sense as possible, no problem - do they?

to make this concrete, a spreadsheet of what each interface calls all the controls on the HMC sampler would help. I’ll try and get one together - or maybe one exists, in which case, please point me to it.

ariddell · May 17, 2019, 1:36pm

pystan 3 uses cmdstan argument/parameter names and defaults. I went so
far as to write a script to parse the output of cmdstan’s help in order
to make sure they match exactly.

mitzimorris · May 17, 2019, 1:52pm

great - where’s the script?

avehtari · May 17, 2019, 2:23pm

I agree, and we should be ready to change the argument names also if we realize that they are not making sense. Especially, the variational inference is labeled as experimental, so no-one should rely it to be stable even in option names…

Yes, that would be helpful and even if would not change something now, it would help when adding new interfaces, new options, or when planning Stan 3.

ariddell · May 17, 2019, 4:25pm

Here it is:

github.com

stan-dev/httpstan/blob/master/scripts/parse_cmdstan_help.py

"""Parse CmdStan 'help-all' output into machine readable JSON."""
import argparse
import collections
import json
import re

METHODS = {"sample", "optimize", "variational", "diagnose"}
OPTIMIZE_ALGORITHMS = {"bfgs", "lbfgs", "newton"}

parser = argparse.ArgumentParser(description="Parse CmdStan 'help-all' text.")
parser.add_argument("input_filename", help="Filename containing 'help-all' output.")


def _split_cmdstan_help(text):
    """Break CmdStan help text into logical parts."""
    parts = {"method": collections.defaultdict(dict), "output": ""}
    for chunk in re.split(r"\n  (?=\w+\n)", text):
        if chunk.lstrip().startswith("method"):
            for subchunk in re.split(r"\n    (?=\w+\n)", chunk):
                header = subchunk.lstrip().split()[0]

This file has been truncated. show original

Here’s a file with the output (likely from CmdStan 2.17): https://github.com/stan-dev/httpstan/blob/master/httpstan/cmdstan-help-all.json

Bob_Carpenter · May 20, 2019, 7:50pm

Yes, this is exactly the design and intention.

Yes.

Yes, but they’re going to be reorganized and flattened.

Yes.

It will be.

Please don’t use an engaged argument. If the number of warmup iterations is zero, there will be no adaptation. I don’t think we need a no-adaptation option that also throws away some set of initial burn-in draws. The only time we won’t want to adapt is when we’re restarting.

I think the skeleton commands in the CmdStan doc are what you want. That’s the set of options you have available from CmdStan.

caesoma · May 20, 2019, 8:05pm

The issue here is that for large models some of the chains may not optimize the sampler options equally well, so the option I found was to run long burn-in chains and manually select a step size and mass matrix from those. The following run would use them without any further optimization optimization.
I understand that for most cases adaptation is required, but the option should be accessible in case something like this is needed (or the possibly rare case where the model is parameterized for a unit mass matrix).

Bob_Carpenter · May 20, 2019, 8:19pm

What I was suggesting is that we don’t need an adapt-engaged argument. If the number of warmup iterations is set to zero, there won’t be any adaptation.

You didn’t say so explicitly, but it sounds like you want to run without adaptation of warmup or step size, but still throw away some initial sequence of draws. That is, you want a BUGS or JAGS “burn in” option. My inclination would be to not support that in our interface and let users prune draws after the fact manually. The downside is that it’d require more memory for you to save everyting then throw the first half out.

One approach may be to write separate functions. That’s how everything is broken out at the C++ level. There’s a Euclidean diagonal metric version of NUTS with adaptation and a completely separate function without, and they both allow you to throw away some initial set of draws. We may go that way if we really do need to support this edge case. I’d really like to avoid the kind of tangled set of argument dependencies we’d get with an adapt-engaged argument.

Bob_Carpenter · May 20, 2019, 8:23pm

Thanks for commenting, by the way. We really appreciate hearing about how people are using Stan. It’s super helpful for design going forward.

We never supported the full cross-product of possibilities for sampling. We do allow unit metrics with adapted step sizes, but we don’t allow other constant mass matrices with adapted step sizes.

I’ve also been lobbying for getting rid of the fixed unit metrics in favor of just turning adaptation off. That will lose the unit metric plus adapt step size option we currently provide.

caesoma · May 20, 2019, 9:09pm

Right, that makes sense, but I think warmup=0 wouldn’t work (at least in previous versions of I think PyStan), not sure why but it wouldn’t actually sample (maybe the step/mass matrix were too far off, but I think it was something else that broke). That (plus not being able to supply a step size and mass matrix) is the reason I started using CmdStan, in fact.

Not really. I just would like to be able to turn off adaptation and provide a fixed step size and mass matrix. I don’t think you have to support the entire cross-product, but it looked like some things were only possible with CmdStan.

I think it’s important to point out the importance of adaptation (like you did) and not make it “too easy” to run Stan without it (as opposed to MH, for there is no conceptual difference between burn-in and sampling phase if scale is tuned throughout). Most times CmdStan will do the job, but R/Python data frames are a lot more convenient for data processing than shell scripts, so I guess most people, myself included, will be happy to restrict bash to a minimum.

Topic		Replies	Views
Various observations on rstan, cmdstanr, pystan and cmdstanpy after teaching with all in parallel Interfaces pystan , rstan , cmdstanr , cmdstanpy	26	3560	October 12, 2021
Prototype: Simplified CLI to Stan General cmdstan	28	1558	September 17, 2020
CmdStanPy 1.0.0rc1 Interfaces	1	889	October 25, 2021
How to use your own stan model -- full tutorial Modeling	15	3838	December 12, 2019
CmdStan slower than PyStan? Interfaces cmdstan , pystan	26	2605	April 17, 2019

Related topics