New to PyStan, looking for pointers for where to go

Hi,

I’m a software developer, not a statistician or data scientist, but I’ve been working with a statistician to help them use PyStan, and I’m interested in possibly contributing.

My relevant expertise is on the python and C++ side. I’m interested in:

  • Helping finish PyStan 3 (as I understand this will move PyStan off the GPL, in addition to other API improvements)
  • Improving the UX of the boundary between C++ and Python (cleaner error messages, simpler compilation/bindings, safer portability of models across runtimes and machines)
  • Building better diagnostics/debugging tools

I’m curious to learn where I should look for the latest PyStan development (I’m guessing pystan-next?) and if there are any good starter tasks to learn some of the codebase.

Thanks! I’ll be around pydata nyc tomorrow if anyone wants to say hi!

2 Likes

Welcome! This is the right place to ask. I’m ccing the PyStan devs: @ariddell @ahartikainen. I’m sure there are more that contribute.

I know a bit about the C++ boundary. So do a bunch of others including @mitzimorris, @seantalts, @sakrejda, and a few others.

Hi Lief. Glad to hear you’re interested in helping out. The PyStan 3
effort could certainly use help. PyStan 3 is (somewhat) usable right
now. There’s a wheel in the Releases of the stan-dev/pystan-next
repository which you can install. (You might also want to read


)

If you’re interested in the C++/python interface you can look at
stan-dev/httpstan. The biggest issue here is performance. There are
IO/threading issues which make things about 10x slower than they should
be. I’m working on that now.

Hi Leif! I’ll be there later today for your talk :)

We’re (mostly @mitzimorris right now) also looking into simultaneously developing a Python wrapper around CmdStan that will be much more light weight in code, license, performance, install, and interface, but it should satisfy something like 95% of our users. We’re talking about that in our meeting right now but I don’t think anyone’s written up a roadmap or skeleton yet. CC’ing @Bob_Carpenter

Wonderful. I’ll read up on the current plans and catch up with @mitzimorris next week. Thanks!

The new PyStan 3/PyStan Next that Ari and Allen are working on will be ISC licensed (not copyleft).

The goal with something like PyCmdStan (an existing package) is to cut the dependency from C++ to make it more accessible on different platforms.

Is there a road map that PyCmdStan should do?

Is it a just a wrapper for CmdStan (subprocess) with external CmdStan?

Should it contain reading output and other processing or just call stansummary etc?

If it just calling CmdStan, then ArviZ could handle the output and plotting?

Can someone create github for the project?

Is the plan to write PyCmdStan in OCaml? I recall someone mentioning
this. Seems like an interesting idea.

I need to meet with everyone to discuss plans and a roadmap for Stan. Here are my thoughts for Python for the moment, but it’s very much parallel to R.

First, I think we should be moving to building a posterior analysis package that’s separate from PyStan and that PyStan does not depend on. We could collaborate on this with everyone else in the Python world—Pyro, TensorFlow Probability, Emcee, PyMCX (PyMC4 is being written in TensorFlow Probability). In R, Jonah broke out the plotting package, but most of the posterior analysis is still in RStan.

Second, I think we should have a lightweight, zero-dependencies Python interface to interact with Stan via file I/O and system calls. PyCmdStan doesn’t quite work in general as is, but something in that vein. The author of PyCmdStan, Michael Woodman, has said he’d be willing to license what he as so we could use it in Stan and help us integrate it.

Third, we need to make our own binary installers for the back-end system calls that include all of the dependencies installed for a local user (rather than at system level, which is very hard for a lot of people on both clusters and with strict IT departments).

Then the install process would be something like:

  1. install the platform-specific Stan binary on your platform
  2. install Py2Stan from a public repo (e.g, PyPI)
  3. install posterior analysis tools

My goals line up with the deliverables:

  1. relieve the user from having to install or manage C++ and to avoid C++ version or configuration or makefile incompatibilities

  2. have the Py2Stan install with near zero chance of failure due to limited external dependencies (ideally just numpy)

  3. make the posterior analysis modular so that other sampling projects could use it, and hopefully, we can get additional help buidling it. I suspect given our position in the stats world and our focus, we’ll still be leading a lot of these efforts [by “we”, I mean the statisticians who understand all this better than me!]).

@mitzimorris has cycles to work on (2) and at least collect resources for (3) and we’re going to hiring dev ops people through NumFOCUS to work on (1).

1 Like