Hey all,
We’ve had a lot of discussion on this and I’ve updated it accordingly. Last time I didn’t do a good job explaining how this came about and what its goals were, so I’ll start with that this time :)
Roadmap: Context
One responsibility the Stan Governing Body placed on the Technical Working Group Director (currently me) was to come up with a roadmap for Stan development. Ideally such a document can be used for several purposes:
- To let our users know what Stan devs are thinking about and working on
- To possibly attract funding for specific items or subprojects
- To draw attention and solicit help both for existing projects and new projects and areas where we think we could use a helping hand or leadership.
To that end, I gathered a bunch of the tech leads for various interfaces together over a weekend of meetings to discuss what everyone had in mind for the coming year. A couple of tech leads weren’t able to attend, so this first resulting roadmap doc avoids certain areas like Stan Math entirely. I think we got a ton of alignment just from discussing what everyone was working on, and this is my attempt to record all of that. I hope in the future we’ll hold another meeting with those who couldn’t attend the first and round out our coverage.
This is hopefully the last draft of this document before it goes out for a full vote by the electorate. This will be the last opportunity to have your voice heard and propose any edits for this doc, which should live for about a year. I’d also like to solicit all of the tech leads to indicate areas of their own project where they’d be open to contributions from others - hopefully this provides a nice entry-point for folks to join the project.
Again, we’re really hoping to get voices heard on this before publishing it, especially if you have resources to commit to your proposals :)
Stan 3 Roadmap Update - Interface Edition
This roadmap document is partial, describing only what we could get rough consensus on in a meeting with most of the tech leads at the end of June 2019. A few less controversial topics have also been added in the ensuing three months of online discussion. This roadmap doc should not be taken to be a complete list of things that will happen in Stan over the next year, nor is it meant to convey any information about items not discussed. The subject matter is mostly around standardization of the interfaces, as well as some developments on the language side of things as the Math lead hasn’t had time for much planning discussion.
Cross-interface Standardization
We’d like to unify the naming conventions and structure of typical *Stan interfaces so that we can share doc, pedagogy, and research easily across interface languages. We’d also like to remove a lot of the confusing knobs and twiddly bits that most users do not need and create a more streamlined experience that lends itself to more easily doing the right thing.
Unifying names
We’d like to make the following name changes, first in the Stan services and CmdStan and then propagating to all of the interfaces. This will necessitate a major version number bump. Anything not mentioned here or in the next section will stay the same. Between these new names and the names that are staying the same (because they’re the same between RStan, PyStan, and CmdStan) we’ll create what we think a minimal interface is.
- num_warmup -> num_warmup_iters
- num_samples -> num_main_iters
- iters -> removed as a concept; replaced by num_warmup_iters and num_main_iters
- init_r -> init_range
- refresh -> update_interval
- adapt_delta -> stepsize_adapt_goal
- lp__ -> target_lpd
- lp_accum -> target_lpd_accum (generated code mostly)
- For diagnostic files and unconstrained parameter name outputs, we’d like to prepend “unconstrained_” to all parameter names to drive home the difference.
Removing some parameters from all interfaces
We’ll get rid of jitter, kappa, gamma, and t0, as we haven’t seen any use cases for these parameters and think maintenance and pedagogy are easier without them.
Functions and methods
This list will serve to specify the minimum required functionality to be a full Stan interface. All names should line up with those in Stan services, which should be aligned to those in this document first.
- model = compile_model(stan_src)
- model.add_data / model$add_data
- model.run_hmc / model$run_hmc
- model.continue_hmc / model$continue_hmc
- model.maximize / model$maximize
- model.generate_quantities / model$generate_quantities
Returning results
A StanFit
object (the object returned by asking Stan to fit a model with data) will have a few ways to access results. The first way is via methods for each of the Stan parameters that maintains the Stan dimensions and merges chains together. So that means that 1000 draws of vector[2] theta[3];
from 4 chains can be accessed with fit$theta
, which will be a 1000x3x2 nested array.
The second mechanism for extracting draws will be directly as an array accessor on fit
; fit[i]
will return the first iteration of all of the Stan parameters unflattened. So if we had theta[3]
and phi[4]
, fit[1,]
is a list of length 2, and fit[1,1]
is a list of length 3.
The third mechanism for extracting draws will be available through asMatrix
methods, which will create separate columns for each of the Stan dimensions as we currently do (e.g. theta.1.2
) and have one row per iteration, as well as a column indicating the chain id.
We’ve had some requests for an additional apply
method that would take a function such as mean
or median
and correctly apply it across iterations and chains such that fit$apply(‘theta’, mean)
on theta above returns a 3x2 array of means. There was some disagreement on this but I would guess if someone wants to submit a PR for it we’d welcome it.
New Lightweight Interfaces
CmdStan
CmdStan has been playing a dual role as a command-line interface for users as well as for interfaces that don’t want to dynamically link against a compiled Stan model (for good reason!). We formally recognize that CmdStan’s interface will focus on command-line use by humans. This just means that all the defaults will be centered around human interaction - e.g. text-based formats for input and output will be the defaults even as we add support for binary formats (PRs welcome!).
CmdStanPy and CmdStanR
These interfaces will be architected to call CmdStan as a separate process for all Stan functionality and will provide a simple reference implementation for other languages to follow when creating a new interface. They should be fairly minimal pass-throughs directly to the underlying CmdStan that are very easy to update when a new version of CmdStan/Stan/Math are released.
Interlude: What’s nice about not doing an in-memory interface?
- Quicker releases - updating to a new version of Core Stan from within your LangStan version should be doable without even upgrading your version of LangStan - you could run a function to download or switch to a new version in-place.
- Robustness - crashing Stan wouldn’t bring down your e.g. R session.
- C compiler agnostic - you wouldn’t need to download the specific version of a C++ toolchain that matched the one your language was built with. Many users already have installed C++ toolchains that don’t match that version and getting those systems to pick up the correct version has historically been a big pain point. We’d also be able to bump the C++ compiler version requirements on the Math library arbitrarily as needed by Math devs.
- Easier and stricter interface - over the years, some of the language interfaces have started to depend on source code within the Stan project rather than using only the exposed APIs, which causes a ton of coordination overhead between interface devs and C++ devs as well as wasting a tremendous amount of interface dev time when some change goes in uncoordinated. Switching to a more black-box CmdStan or ServerStan based interface would prevent all of that.
ServerStan (Needs a tech lead! Could be you!)
For more robust programmatic consumption, we’ll create a separate ServerStan that is conceptually similar to Allen’s HTTPStan (underlying PyStan). The benefit to ServerStan over a CmdStan-based interface is essentially just that you can have access to fast log_prob evaluations (as well as a few other, somewhat more niche features of a similar flavor). Allen has mentioned that he’d be interested in switching PyStan over to use such an interface once it’s available.
The key differences between Allen’s HTTPStan and ServerStan:
- ServerStan will be compiled into a statically linked binary, one per Mac, Linux, and Windows, that should run on any version. Eventually, we’ll link against LLVM and Clang so that users won’t even need a C++ toolchain installed.
- ServerStan might not use HTTP - we’ll look for something that works easily on R, Python, Julia, and a variety of languages but includes support for binary-encoded data messages, since processing text formats can take a ton of time. grpc seems to be a reasonable candidate with a long track record of good performance and bindings for many languages.
- We could even have the new compiler automatically generate .proto files for a given model to make really efficient interfaces.
Stan C++ Services
We’ll add diagnostic services to the C++ and simplify the code to get rid of things that aren’t needed to support interfaces (e.g. the ~16 flavors of HMC). For the new diagnostic service API calls, we’ll provide functions on std::vector<double *>
with compute_
, check_
, and format_*_error
variants that all take the same inputs and return either the computed value, a boolean indicating if it passed the check, or a formatted string error message if the check didn’t pass, respectively. We’d also like to create example code showing how to call each of these services given a fit (we should be able to create a reasonable version of that when we create the tests for them). These functions will then be wrapped up and exposed as methods StanFit object in the downstream language-specific Stan interface - you’ll be able to call each of them or all of them at once on a specific set of parameters or on all of them at once. Here are the various quantities that will have this functionality:
- effective_sample_size
- rhat
- treedepth
- energy
- quantile
- mean
- variance
For example, compute_rhat will return the rhat statistic, check_rhat will compute it and check that it is below a certain threshold that the Stan team has empirically found useful, and format_rhat will call check_rhat and, if there is an error, return a formatted string presentable to a user, something like “rhat of 1.2 was above safe threshold of 1.1.”
Model Class Augmentations
The model class was written with the principles of encapsulation and modularity in mind and does not expose the information required for increasingly complex interfaces and use-cases built around Stan. Luckily we can remedy most of this fairly easily by adding additional functionality without breaking anything, and eventually we can tweak some of the existing signatures (to e.g. get rid of all of the in/out arguments to functions that force the language interfaces to manage their own memory, and instead just take const inputs and return outputs as values) whenever the interfaces are ready to consume that change.
Faster compile times
Template parameter defaults
We’ll provide sensible defaults (and adjust names to be consistent across all of the C++ code) for all template parameters,
- T__=double
- propto__=false
- jacobian__=false
New methods for ease of use
It’s easier for the R and Python interfaces to call methods on a fit object rather than hooking up a templated functional to a model. See e.g. https://github.com/stan-dev/stan/pull/2701
- hessian
- hessian_grad_product
Parameter metadata
We’ve decided toinclude two new methods that expose sized parameter types on both the constrained and unconstrained scale as a JSON string. In general, exposing metadata like this in an easily-consumable format should add a lot of power to what these interfaces can do, and it solves the first use case on the list - more use-cases forthcoming. These methods should be coming up soon in the new compiler:
string get_unconstrained_types();
string get_constrained_types();
An example response for the constrained type for cov_matrix[N] m[K, J];
given N=27, K=3, J=4:
[{"name":"m",
"type":{
"name":"array",
"size":3,
"type":{
"name":"array",
"size":4,
"type":{
"name":"matrix",
"rows":27,
"cols":27}}}}]
We’ll also need a method that returns a dictionary of parameter name to transform applied.
Stan API
The only functions and header files that will remain stable are those in the Stan services and the model class. Everything else can change at whim - names, file locations, directory structures, etc. This is an especially important consideration at a time with RStan is using some of the Stan source code that isn’t in services, but we’re switching to a new compiler that will end up being in a different language and uninclude-able.
Of course, before anyone could reasonably expect this to happen, we need to augment the services and model class to add APIs that will satisfy the needs of the minimal interfaces above. Please file tickets on the stanc3 repo for anything you think is missing!
Stan language features
- Closures and lambdas - these should make specifying functions for parallel map and various solvers much easier.
- Ragged arrays
- Sparse Matrices
- Support for avoiding unnecessary transfers to and from OpenCL devices such as GPUs. This should lead to really extreme speedups over what’s currently possible, and is something really difficult to implement in the Math library.
Stan in Stan
One of @seantalts goals will be to add language features such that users can define most functions that they would need in the Stan language itself. This should allow users to package up and share additional Stan functionality much more easily than requiring them to submit a PR to the Math library in C++. Functions defined in Stan will then become more efficient than ones defined in C++ because we’ll be able to do whole-program optimization through that code, eliminating redundant checks, pulling out normalization constants to outside of the HMC loop, etc.
For this, we’d need the following additional features (and we’re looking for volunteers to spearhead them!)
- User-defined derivatives.
- An “auto-propto” optimization that can take arbitrary Stan code and analyze which statements only need to execute conditional on needing gradients for their input variables. E.g. automatically inserting if wrappers like these
- Optionally, an analysis phase that pulls out data-only computation to outside of the HMC loop (i.e. into the constructor of the Stan model). @mgorinova’s SlicStan paper has an algorithm for doing this.
Other compiler deliverables that need volunteers
- A proper ‘extern’ keyword that automates much of the work required to hook up a custom C++ function with gradients.
- Named blocks or annotations for things like “save these variables to output.”
- Blockless - there were some strong arguments for preserving most of the blocked structure, but no one seemed like they’d miss
transformed parameters
, and most people seemed to think that the data and transformed data block could use a name more directly indicating that those blocks are for input and pre-processing. - Getting rid of arrays of real numbers from the language entirely to avoid confusion with vectors.
- Folks who’d like to be able to access the gradient of Stan or User Defined Functions in the model block.
Stan Parallelism
We’re moving into an age where folks have even more cores available to them and need to architect for this across the C++ codebase. To that end, we’ll either use something like the TBB or create our own system that provides load-balanced scheduling of tasks across pool of workers.
Miscellaneous ideas
- Writing samples out in some format that it would be easy to run column-wise or other queries over the data as stored on disk. Something like sqlite might not even be crazy here.
- With threading, we could now run multiple chains of warmup in parallel and pool information among the chains for faster adaptation.
New Compiler Integration
The new Stan compiler written in OCaml (“stanc3”) is ready for beta testing! We’re targeting replacing the old compiler completely for the 2.21.0 release on October 18th, 2019.
Future Roadmap Topics
- Serialization of input data and output samples
- General I/O framework
- Updating Autodiff for parallelism
- Integration of a C++ compiler
Order of operations
We had talked at the meeting about implementing all of these ideas in CmdStan first and letting the interfaces use that as their reference. However, we don’t currently have anyone actively maintaining CmdStan as their primary job and would love volunteers! And due to that I don’t think the other language interfaces should wait on CmdStan to update names and go to Stan 3.
As always, please comment if I got things wrong or if you have ideas! And if any of these projects seem like places you might be able to help out please get in touch.
Thanks all!