Stan Algorithm API

If a new sampler (or is it just a new name for the existing one?) is being introduced during the next development phase, it’s really unclear why a somewhat-breaking change should be rushed in right now.


that said, I still think we need to set up an algorithms API, so I hope the work in this thread keeps going. I like @anon75146577’s idea to set this up as a design doc so we can do line by line commenting :)


I like that formulation of what an API doc is about. I just like to think of it all as following from the Golden Rule, which that guide neatly pointed out arises from empathy. I keep meaning to make a post on that.

I’m not convinced developer empathy is maximized in open source. We had a much easier time managing thing at SpeechWorks with a clear chain of command than we’ve had with Stan, and the SpeechWorks project had more full-time equivalents working on it.

A full API spec would be documented function, class, etc. signatures. Wikipedia sums this up neatly:

An application programming interface ( API ) is an interface or communication protocol between a client and a server intended to simplify the building of client-side software. It has been described as a “contract” between the client and the server, such that if the client makes a request in a specific format, it will always get a response in a specific format or initiate a defined action.

Here, client/server are being used generically as user/provider.

In an idealized test-driven development setting, the API is laid down before development begins. In any setting, I think these the doc, API, and test plan need to be developed together, because they’re all going to need to agree in the end. The doc should doc the API and the test should test the documented API as documented.

I am aware of the formal definition of an API, but we cannot define a productive formal API in this case given the lack of guarantees that we can provided for the return objects to which users have become expected.

Probabilistic estimation ultimately returns estimators of expectation values with respect to a specified target distribution. Already this conflicts with standard API definitions because the server can’t guarantee explicit, uniform behavior of these estimators to the client.

In order to guarantee anything uniformly we would first have to drastically limit the scope of the modeling language to something even less expressive than lm, the functions whose expectations could be considered, and the size of the data. Even then we wouldn’t be able to guarantee anything outside of the assumption that the specified model contains the true data generating process.

It’s even worse with stochastic estimators like MCMC because the weak guarantees that we do have under limited circumstances are not deterministic. Moreover users have become accustomed to receiving samples, and not the MCMC estimators themselves, but those samples cannot be used arbitrarily. They carry with them weak, stochastic guarantees only when used properly, and it is up to the client to use them properly in the context of the specified target distribution (model + data, if any), functions of interest, and meta information, such as diagnostics, provided.

This document defines the contract between Stan and the end user (whether it be an interface or an actual user), defining as rigorously as possible how the intermediate quantities should be used and the lack of guarantees about that use provided by the service routes beyond their basic shape/type.

Reading over this conversation I was a bit confused. If we think of the parts of the API broken up with something like

  1. User facing service layer
  2. Developer facing layer
  3. Contract on return object structure for (1) and (2)
  4. Contract on statistical properties of return object for (1) and (2)

Most of this conversation is around (4)? Would be nice to also have 1:3 defined

While we can’t make guarantees for any arbitrary models would it be useful to define tests with something like SBC? In my mind space the only way to deal with 4 is to have some unit testing scheme on known models and models we know we will do badly at.

Part of the issue that this all came from is that the algorithms are hard and important and we need some nice way to check that changes make sense. If we don’t do that then pretty much any change to algorithms is going to be stalled like this. Maybe this would even be a nice place for the bayesian posterior database, we can use it to define a formal testing scheme for the level of algorithmic soundness we feel comfortable giving a level of promise too.

1 Like


Definitely, it’s never really been a point of contention so far and hence not often discussed.

It’s also a bit tricky because the structure provided by the C++ API and the interface APIs is completely different. The hope was to unify this a bit for Stan3 but there hasn’t been much roadmap discussion on that point.

The problem with this kind of approach is that statistical computation is fragile. Empirical testing is great for finding universal bugs and identifying potential pathologies but guarantees for a finite set of models have little extrapolation to the full ensemble of models. Seemingly inconsequential changes to the priors, observational model, data, or even looking at different expectation values (i.e. chaining generated quantities) can break the computation hard. This is why we have so much emphasis on bespoke diagnostics and workflows – for any kind of robustness users have to check the validity of their computation in the specific context of their model. Because we return the raw samples to which users have grown accustomed we cannot compel them to follow a robust diagnostic workflow, but we can document the dangers as well as possible. Hence the overview.

For example, one limitation of SBC is that it can test only models that contain the true data generating process. It cannot test against misspecification at all, let alone the infinite ways misspecification can manifest in practice. In other words SBC is really more of a self-consistency check. That check can be really powerful in finding problems that manifest due to, for example, poor experimental design. Computation with a model rigorously tested against SBC, however, can rapidly fall apart when the model is conditioned on data even slightly perturbed from that self-consistent context.

Outside of SBC we require known expectation values to be able to test anything, which typically limits tests to very simple models that do not characterize models that arise in more realistic applied settings. Even the posteriordb is relying on running HMC to get baselines for most of the entries which runs into circular problems if we don’t assume that HMC is valid on that specific model.

Things get even more complicated if you want to try to “test” pathologies. We have very specific probabilistic (and technically approximate) guarantees when things are working, but once things break we have no mathematical guarantee in how that break will manifest and hence no way to well-define baselines against which we can test. In particular, heuristic tests that try to recover empirical pathological behavior are just implicit regression tests and don’t actually test any useful computational behavior.


I don’t get this objection. Why can’t the API be probabilistic or based on a seed?

Agreed. That’s what I was thinking of as defining the bulk of the API in terms of inputs and outputs.

Because we can’t guarantee the behavior of the outputs.

If you want to define the API in terms of a structured array of numbers, i.e. samples without any guarantee on their statistical behavior, then pseudo random output is no problem at all. That would be appropriate for 1:3 for which there really hasn’t been any controversy.

The problem is when you want to try to define those samples as having any guaranteed statistical behavior, such as “if you compute an empirical average of a function then that will be close to the exact expectation value”. These guarantees are more in line with what users are expecting – they don’t just want a bag of numbers they want numbers that characterize their model in some way.

Got it. I was just thinking of 1:3.