Interface roadmap - last draft before ratification vote

seantalts · September 12, 2019, 6:25pm

Hey all,

We’ve had a lot of discussion on this and I’ve updated it accordingly. Last time I didn’t do a good job explaining how this came about and what its goals were, so I’ll start with that this time :)

Roadmap: Context

One responsibility the Stan Governing Body placed on the Technical Working Group Director (currently me) was to come up with a roadmap for Stan development. Ideally such a document can be used for several purposes:

To let our users know what Stan devs are thinking about and working on
To possibly attract funding for specific items or subprojects
To draw attention and solicit help both for existing projects and new projects and areas where we think we could use a helping hand or leadership.

To that end, I gathered a bunch of the tech leads for various interfaces together over a weekend of meetings to discuss what everyone had in mind for the coming year. A couple of tech leads weren’t able to attend, so this first resulting roadmap doc avoids certain areas like Stan Math entirely. I think we got a ton of alignment just from discussing what everyone was working on, and this is my attempt to record all of that. I hope in the future we’ll hold another meeting with those who couldn’t attend the first and round out our coverage.

This is hopefully the last draft of this document before it goes out for a full vote by the electorate. This will be the last opportunity to have your voice heard and propose any edits for this doc, which should live for about a year. I’d also like to solicit all of the tech leads to indicate areas of their own project where they’d be open to contributions from others - hopefully this provides a nice entry-point for folks to join the project.

Again, we’re really hoping to get voices heard on this before publishing it, especially if you have resources to commit to your proposals :)

Stan 3 Roadmap Update - Interface Edition

This roadmap document is partial, describing only what we could get rough consensus on in a meeting with most of the tech leads at the end of June 2019. A few less controversial topics have also been added in the ensuing three months of online discussion. This roadmap doc should not be taken to be a complete list of things that will happen in Stan over the next year, nor is it meant to convey any information about items not discussed. The subject matter is mostly around standardization of the interfaces, as well as some developments on the language side of things as the Math lead hasn’t had time for much planning discussion.

Cross-interface Standardization

We’d like to unify the naming conventions and structure of typical *Stan interfaces so that we can share doc, pedagogy, and research easily across interface languages. We’d also like to remove a lot of the confusing knobs and twiddly bits that most users do not need and create a more streamlined experience that lends itself to more easily doing the right thing.

Unifying names

We’d like to make the following name changes, first in the Stan services and CmdStan and then propagating to all of the interfaces. This will necessitate a major version number bump. Anything not mentioned here or in the next section will stay the same. Between these new names and the names that are staying the same (because they’re the same between RStan, PyStan, and CmdStan) we’ll create what we think a minimal interface is.

num_warmup -> num_warmup_iters
num_samples -> num_main_iters
iters -> removed as a concept; replaced by num_warmup_iters and num_main_iters
init_r -> init_range
refresh -> update_interval
adapt_delta -> stepsize_adapt_goal
lp__ -> target_lpd
lp_accum -> target_lpd_accum (generated code mostly)
For diagnostic files and unconstrained parameter name outputs, we’d like to prepend “unconstrained_” to all parameter names to drive home the difference.

Removing some parameters from all interfaces

We’ll get rid of jitter, kappa, gamma, and t0, as we haven’t seen any use cases for these parameters and think maintenance and pedagogy are easier without them.

Functions and methods

This list will serve to specify the minimum required functionality to be a full Stan interface. All names should line up with those in Stan services, which should be aligned to those in this document first.

model = compile_model(stan_src)
model.add_data / model$add_data
model.run_hmc / model$run_hmc
model.continue_hmc / model$continue_hmc
model.maximize / model$maximize
model.generate_quantities / model$generate_quantities

Returning results

A StanFit object (the object returned by asking Stan to fit a model with data) will have a few ways to access results. The first way is via methods for each of the Stan parameters that maintains the Stan dimensions and merges chains together. So that means that 1000 draws of vector[2] theta[3]; from 4 chains can be accessed with fit$theta, which will be a 1000x3x2 nested array.

The second mechanism for extracting draws will be directly as an array accessor on fit; fit[i] will return the first iteration of all of the Stan parameters unflattened. So if we had theta[3] and phi[4], fit[1,] is a list of length 2, and fit[1,1] is a list of length 3.

The third mechanism for extracting draws will be available through asMatrix methods, which will create separate columns for each of the Stan dimensions as we currently do (e.g. theta.1.2) and have one row per iteration, as well as a column indicating the chain id.

We’ve had some requests for an additional apply method that would take a function such as mean or median and correctly apply it across iterations and chains such that fit$apply(‘theta’, mean) on theta above returns a 3x2 array of means. There was some disagreement on this but I would guess if someone wants to submit a PR for it we’d welcome it.

New Lightweight Interfaces

CmdStan

CmdStan has been playing a dual role as a command-line interface for users as well as for interfaces that don’t want to dynamically link against a compiled Stan model (for good reason!). We formally recognize that CmdStan’s interface will focus on command-line use by humans. This just means that all the defaults will be centered around human interaction - e.g. text-based formats for input and output will be the defaults even as we add support for binary formats (PRs welcome!).

CmdStanPy and CmdStanR

These interfaces will be architected to call CmdStan as a separate process for all Stan functionality and will provide a simple reference implementation for other languages to follow when creating a new interface. They should be fairly minimal pass-throughs directly to the underlying CmdStan that are very easy to update when a new version of CmdStan/Stan/Math are released.

Interlude: What’s nice about not doing an in-memory interface?

Quicker releases - updating to a new version of Core Stan from within your LangStan version should be doable without even upgrading your version of LangStan - you could run a function to download or switch to a new version in-place.
Robustness - crashing Stan wouldn’t bring down your e.g. R session.
C compiler agnostic - you wouldn’t need to download the specific version of a C++ toolchain that matched the one your language was built with. Many users already have installed C++ toolchains that don’t match that version and getting those systems to pick up the correct version has historically been a big pain point. We’d also be able to bump the C++ compiler version requirements on the Math library arbitrarily as needed by Math devs.
Easier and stricter interface - over the years, some of the language interfaces have started to depend on source code within the Stan project rather than using only the exposed APIs, which causes a ton of coordination overhead between interface devs and C++ devs as well as wasting a tremendous amount of interface dev time when some change goes in uncoordinated. Switching to a more black-box CmdStan or ServerStan based interface would prevent all of that.

ServerStan (Needs a tech lead! Could be you!)

For more robust programmatic consumption, we’ll create a separate ServerStan that is conceptually similar to Allen’s HTTPStan (underlying PyStan). The benefit to ServerStan over a CmdStan-based interface is essentially just that you can have access to fast log_prob evaluations (as well as a few other, somewhat more niche features of a similar flavor). Allen has mentioned that he’d be interested in switching PyStan over to use such an interface once it’s available.

The key differences between Allen’s HTTPStan and ServerStan:

ServerStan will be compiled into a statically linked binary, one per Mac, Linux, and Windows, that should run on any version. Eventually, we’ll link against LLVM and Clang so that users won’t even need a C++ toolchain installed.
ServerStan might not use HTTP - we’ll look for something that works easily on R, Python, Julia, and a variety of languages but includes support for binary-encoded data messages, since processing text formats can take a ton of time. grpc seems to be a reasonable candidate with a long track record of good performance and bindings for many languages.
We could even have the new compiler automatically generate .proto files for a given model to make really efficient interfaces.

Stan C++ Services

We’ll add diagnostic services to the C++ and simplify the code to get rid of things that aren’t needed to support interfaces (e.g. the ~16 flavors of HMC). For the new diagnostic service API calls, we’ll provide functions on std::vector<double *> with compute_, check_, and format_*_error variants that all take the same inputs and return either the computed value, a boolean indicating if it passed the check, or a formatted string error message if the check didn’t pass, respectively. We’d also like to create example code showing how to call each of these services given a fit (we should be able to create a reasonable version of that when we create the tests for them). These functions will then be wrapped up and exposed as methods StanFit object in the downstream language-specific Stan interface - you’ll be able to call each of them or all of them at once on a specific set of parameters or on all of them at once. Here are the various quantities that will have this functionality:

effective_sample_size
rhat
treedepth
energy
quantile
mean
variance

For example, compute_rhat will return the rhat statistic, check_rhat will compute it and check that it is below a certain threshold that the Stan team has empirically found useful, and format_rhat will call check_rhat and, if there is an error, return a formatted string presentable to a user, something like “rhat of 1.2 was above safe threshold of 1.1.”

Model Class Augmentations

The model class was written with the principles of encapsulation and modularity in mind and does not expose the information required for increasingly complex interfaces and use-cases built around Stan. Luckily we can remedy most of this fairly easily by adding additional functionality without breaking anything, and eventually we can tweak some of the existing signatures (to e.g. get rid of all of the in/out arguments to functions that force the language interfaces to manage their own memory, and instead just take const inputs and return outputs as values) whenever the interfaces are ready to consume that change.

Faster compile times

This already came to pass!

Template parameter defaults

We’ll provide sensible defaults (and adjust names to be consistent across all of the C++ code) for all template parameters,

T__=double
propto__=false
jacobian__=false

New methods for ease of use

It’s easier for the R and Python interfaces to call methods on a fit object rather than hooking up a templated functional to a model. See e.g. https://github.com/stan-dev/stan/pull/2701

hessian
hessian_grad_product

Parameter metadata

We’ve decided toinclude two new methods that expose sized parameter types on both the constrained and unconstrained scale as a JSON string. In general, exposing metadata like this in an easily-consumable format should add a lot of power to what these interfaces can do, and it solves the first use case on the list - more use-cases forthcoming. These methods should be coming up soon in the new compiler:

string get_unconstrained_types();
string get_constrained_types();

An example response for the constrained type for cov_matrix[N] m[K, J]; given N=27, K=3, J=4:

[{"name":"m",
  "type":{
     "name":"array",
     "size":3,
     "type":{
        "name":"array",
        "size":4,
        "type":{
           "name":"matrix",
           "rows":27,
           "cols":27}}}}]

We’ll also need a method that returns a dictionary of parameter name to transform applied.

Stan API

The only functions and header files that will remain stable are those in the Stan services and the model class. Everything else can change at whim - names, file locations, directory structures, etc. This is an especially important consideration at a time with RStan is using some of the Stan source code that isn’t in services, but we’re switching to a new compiler that will end up being in a different language and uninclude-able.

Of course, before anyone could reasonably expect this to happen, we need to augment the services and model class to add APIs that will satisfy the needs of the minimal interfaces above. Please file tickets on the stanc3 repo for anything you think is missing!

Stan language features

Closures and lambdas - these should make specifying functions for parallel map and various solvers much easier.
Ragged arrays
Sparse Matrices
Support for avoiding unnecessary transfers to and from OpenCL devices such as GPUs. This should lead to really extreme speedups over what’s currently possible, and is something really difficult to implement in the Math library.

Stan in Stan

One of @seantalts goals will be to add language features such that users can define most functions that they would need in the Stan language itself. This should allow users to package up and share additional Stan functionality much more easily than requiring them to submit a PR to the Math library in C++. Functions defined in Stan will then become more efficient than ones defined in C++ because we’ll be able to do whole-program optimization through that code, eliminating redundant checks, pulling out normalization constants to outside of the HMC loop, etc.

For this, we’d need the following additional features (and we’re looking for volunteers to spearhead them!)

User-defined derivatives.
An “auto-propto” optimization that can take arbitrary Stan code and analyze which statements only need to execute conditional on needing gradients for their input variables. E.g. automatically inserting if wrappers like these
Optionally, an analysis phase that pulls out data-only computation to outside of the HMC loop (i.e. into the constructor of the Stan model). @mgorinova’s SlicStan paper has an algorithm for doing this.

Other compiler deliverables that need volunteers

A proper ‘extern’ keyword that automates much of the work required to hook up a custom C++ function with gradients.
Named blocks or annotations for things like “save these variables to output.”
Blockless - there were some strong arguments for preserving most of the blocked structure, but no one seemed like they’d miss transformed parameters, and most people seemed to think that the data and transformed data block could use a name more directly indicating that those blocks are for input and pre-processing.
Getting rid of arrays of real numbers from the language entirely to avoid confusion with vectors.
Folks who’d like to be able to access the gradient of Stan or User Defined Functions in the model block.

Stan Parallelism

We’re moving into an age where folks have even more cores available to them and need to architect for this across the C++ codebase. To that end, we’ll either use something like the TBB or create our own system that provides load-balanced scheduling of tasks across pool of workers.

Miscellaneous ideas

Writing samples out in some format that it would be easy to run column-wise or other queries over the data as stored on disk. Something like sqlite might not even be crazy here.
With threading, we could now run multiple chains of warmup in parallel and pool information among the chains for faster adaptation.

New Compiler Integration

The new Stan compiler written in OCaml (“stanc3”) is ready for beta testing! We’re targeting replacing the old compiler completely for the 2.21.0 release on October 18th, 2019.

Future Roadmap Topics

Serialization of input data and output samples
General I/O framework
Updating Autodiff for parallelism
Integration of a C++ compiler

Order of operations

We had talked at the meeting about implementing all of these ideas in CmdStan first and letting the interfaces use that as their reference. However, we don’t currently have anyone actively maintaining CmdStan as their primary job and would love volunteers! And due to that I don’t think the other language interfaces should wait on CmdStan to update names and go to Stan 3.

As always, please comment if I got things wrong or if you have ideas! And if any of these projects seem like places you might be able to help out please get in touch.

Thanks all!

bgoodri · September 12, 2019, 6:34pm

I don’t think this is what anyone said and if so, I don’t think it is a good idea. I think it should be {StanDimensions,Chains,Iterations} or which in this case would be 3x2x4x1000 for convergence purposes or for inference purposes where you can ignore the distinction between chains, 3x2x4000. Putting the StanDimensions first allows you distinguish between chains for convergence or not distinguish between chains for inference without permuting dimensions or reallocating memory.

bgoodri · September 12, 2019, 6:37pm

This should also apply to Stan Math. I think for things under prim/ the default template parameter should be double, var for things under rev/ , and fvar<var> for things under fwd/.

bgoodri · September 12, 2019, 6:40pm

I guess JSON is fine, although RStan would have to rely on a package that does JSON parsing. But the name in the last level should be "cov_matrix" rather than "matrix". Otherwise, we wouldn’t know that symmetric elements can be ignored for summarizing purposes.

seantalts · September 12, 2019, 6:45pm

Ahh, I tried so hard to get people to be precise on that thread about what they wanted. @andrewgelman asked for that here: here and here and I didn’t see any objections or clarifications. It also seemed to match up with the storage table, where iterations are changing most slowly. Does that mean the table representing the internal storage is also not what you wanted?

How do you propose having fit$theta return 3x2x4x1000 or 3x2x4000 depending on whether someone intends to use it for convergence or inference?

@andrewgelman, would you be happy with the ordering Ben is proposing?

seantalts · September 12, 2019, 6:49pm

I first suggested S-expressions and @jonah said you guys would prefer JSON. I can change it to S-expressions if you want. I wasn’t planning to include what transformation was applied - we’ve compiled that away by the time we get to code generation. I ran that past you guys back when and it seemed okay at the time as long as you have both the constrained and unconstrained types with dimensions. What kind of summarizing do you need to do that needs to know about a variable’s transformations?

seantalts · September 12, 2019, 6:50pm

I didn’t want to include Stan Math in this meeting because the Math lead hasn’t been available for any roadmap meetings yet.

bgoodri · September 12, 2019, 6:52pm

I think either JSON or something binary is fine, perhaps both. One of the complaints that we get when the summary of a covariance matrix, Sigma, is printed is that you see the same summary for Sigma[1,2] as for Sigma[2,1] because it is a symmetric matrix. So, a better summary would only show that once. Furthermore, there is no need to summarize the fact that the diagonal of a correlation matrix is always 1 or that the upper triangle of a Cholesky factor is always zero. But we need to know what type of matrix the thing is to do that.

bgoodri · September 12, 2019, 6:54pm

I understand. I was just saying that the same arguments for (I don’t know of any against) having default template arguments in generated C++ code would apply to C++ code in the Math library.

seantalts · September 12, 2019, 6:55pm

Damn. We can include another method that exposes a variable’s transformation, then. I’ll make an issue on the stanc3 tracker.

bgoodri · September 12, 2019, 6:57pm

OK. I thought that was already in the works, but if not it is needed also for doing visualizations (parcoord, pairs, etc.) in the unconstrained space.

avehtari · September 12, 2019, 7:09pm

Here is an example which would kill Bob’s computer (and probably mine, too) without thin

thin is rarely needed, but there are cases where if thin is removed, things get very difficult.

How about advi diagnostic? (there is currently experimental code in R for advi in rstan and rstanarm, and for optimizing in some cases in rstanarm)

jonah · September 12, 2019, 7:26pm

Sorry for any confusion. I do prefer JSON and think it would be easy for us to use, but it doesn’t have to be JSON. I guess Ben is right that we would probably use an external R package to parse the JSON, but I think that’s ok (those packages are solid).

jonah · September 12, 2019, 7:31pm

That would be great. In the short term just having the sizes would be a big improvement but exposing the transformations eventually would be preferable.

betanalpha · September 12, 2019, 8:31pm

Not having everything in memory also makes the interface more scalable to longer chains or the wider chains from more complex models with lots of parameters and generated quantities.

andrewgelman · September 13, 2019, 1:36am

Hi, just to clarify, the example I gave was the following:
alpha is a scalar, beta is a vector of length 10, and theta is a 2 x 3 x 4 array.

Let’s assume also that we have 4 chains with 1000 simulations saved each.

Then there are several things I’d like to access.

We already have as.matrix(fit) which in this case I think will return a 4000 x 35 matrix of simulations.
There also must be a way to access the 1000 x 4 x 35 array of simulations so as to get different chains. As a user I typically won’t want this, but we certainly need it for monitor, shinystan, etc. We’ll also need a version for chains of unequal length, i.e. a sort of ragged array.
I want to be able to access one named parameter at a time: thus, a vector of length 4000 for alpha, a 4000 x 10 array for beta, and a 4000 x 2 x 3 x 4 array for theta. Currently we can do this via extract(fit)$alpha, extract(fit)$beta, extract(fit)$theta.
I’d like to be able to extract one draw. For example, extract_one_draw(fit, 82) would return a list with three elements corresponding to the 82nd simulation draw, thus a scalar for alpha, a vector of length 10 for beta, and a 2 x 3 x 4 array for theta.

It’s ok for me to be able to access this by iteration number (between 1 and 4000, corresponding to the 4 chains strung together), i.e., 82 = 1st chain, iteration 82, or 2311 = 3rd chain, iteration 311.

I’d like to be able to access summaries. For example, foo(fit, median) would return a list with three elements corresponding to the median of alpha, the pointwise median of beta (i.e., a vector of length 10), and the pointwise median of theta (i.e., a 2 x 3 x 4 array).

I’d like those last two things because I have recently been tying myself in knots with code trying to work with simulation draws and estimates, for example graphing fitted models.

Andrew

bgoodri · September 13, 2019, 1:42am

I understand you want to access results in various ways but do you have a substantive and / or technical reason why the first dimension is either iterations or chains \times iterations, as opposed to putting the dimensions from the Stan program first, followed by chains, and iterations? In other words, in your example something declared in Stan as real theta[2,3,4]; would be 2x3x4x(4x1000) in the output.

andrewgelman · September 13, 2019, 1:53am

OK, two things:

When we get the big matrix or array, we’ve always done iterations and chains first, then the dimensions from the model after. I’m just used to that. For example, in my books when I illustrate the matrix of simulations, each row is a simulation and each column is a scalar parameter. So if we keep it that way, it will be consistent with my books, including Regression and Other Stories, where the first index is the simulation draw. We could switch, but it would break some backward compatibility. I guess if there’s a good enough reason to switch, we could do so?
I also want to access objects that are the same size and shape as in the Stan program; I don’t want to be stripping out individual draws or computing means,medians,etc. directly as then I have to fumble with indexes, and the code gets particularly ugly when we have parameters that are arrays of differing dimensions.

bgoodri · September 13, 2019, 2:04am

(1) is what is being discussed on this thread. If you put the dimensions of the thing in the Stan program, followed by iterations and chains, then you can do more iterations for the same chains and append the new draws to the “back” of the old draws. If you put iterations and chains first, followed by the dimensions of the thing in the Stan program, I don’t know of any non-convoluted ways of adding more draws. Also, putting the Stan dimensions first aligns the Stan code with the post-estimation interface code, in the sense that theta[1,2] in Stan means essentially the same thing as theta[1,2] in R, except you have a bunch of draws of that random variable.

andrewgelman · September 13, 2019, 2:15am

That makes sense. If we’re breaking backward compatibility in that way (and I assume also with rstanarm), then I’d like to do it sooner rather than later so I can fix Regression and Other Stories to be consistent. We’ll also have to update appendix C of BDA3 and various other things such as case studies, but I guess these are less of a pressing concern.
I’d also like us to do item 2!

Topic		Replies	Views
Stan roadmap Developers	3	917	February 11, 2018
Posting more updates in discourse Developers	1	722	October 11, 2018
Planning the 2.33 release Developers	7	817	August 30, 2023
Planning the 2.37 release Developers	7	423	June 4, 2025
Stan General Meeting Thursday September 26, 2019 11am EDT Meetings	17	1470	September 30, 2019