Stan Algorithm API

Following up with @seantalts’s request I put together a summary of what the algorithms in Stan can and can’t guarantee, https://github.com/stan-dev/stan/wiki/The-Stan-Algorithm-API. This defines the boundaries of well-defined tests and is consistent with our current testing frameworks, as well as more elaborate testing frameworks that have been discussed.

Regarding the latter see for example the very old branch https://github.com/stan-dev/stan/blob/feature/stat_valid_test/src/test/stat-validity/stat_valid_test_fixture.hpp which demonstrates how to use ensembles of p-values to test for an MCMC CLT with controllable false positive rates. If we ever refactor the I/O code (specs are available if anyone is interested in digging into a substantial project!) and enable fully in-memory C++ callbacks then we can integrate this into explicit integration tests.

In any case, if anyone has any questions or comments let me know.

2 Likes

Thank you! We need an API to do proper semantic versioning (and answer, at the policy-level, questions about doc and testing) and this doc looks like an awesome starting point.

I want to tag a bunch of people who might care about this. My goal here is to quickly see if we can all mostly agree on this and use it to evaluate the testing of the recent PR. If it becomes clear that there’s a lot of contention on parts of the API relating to the PR then I will pull it out for 2.21 and we can continue the policy-level API discussion without that time pressure. So think of this as a quick gut check from the folks who care a lot about the algorithms API (and obviously feel free to comment and tag folks I’m momentarily forgetting).

/cc @Bob_Carpenter @andrewgelman @avehtari @bbbales2 @nhuurre @charlesm93 @yizhang @wds15

The two statements in the wiki(highlighted by me)

If Stan-provided diagnostics do not fail then Markov chain Monte Carlo estimators of square integrable functions will satisfy a central limit theorem for sufficiently long chains.

In particular, if the effective sample size is estimated accurately then for any real-valued function f

Are those statements consistent? The first one applies to MC with geometrical ergodicity + detailed balance, the second one is valid for almost surely bounded Borel function f with a stronger assumption on the MC’s ergodicity(polynomial ergodicity, IIRC).

2 Likes

Thanks asking, but can we wait on this until we get to a decision regarding this? I know it was proposed as a solution to that here, but I think that it misses the issue (explanation here).

Good call! The CLT requires square integrality (technically (2 + \epsilon)-integrability in some cases) for the variance, and hence the MCMC-SE to be well-behaved. Technically geometric ergodicity is not a necessary condition for a CLT but the exceptions are, well, quite exceptional. In particular the polynomial ergodicity results typically aren’t of general applied utility.

I updated the second statement to explicitly include the square integrability condition. Thanks!

2 Likes

Looks good but I don’t think this would have been sufficient to preempt the incident.
The question there was stability between releases. If a user has a working application built on Stan 2.21, when can they expect it to continue to work if they upgrade to Stan 2.22? I guess the expectation is that any model that works with the default settings will continue to work with the defaults. But what if a model requires a lot of tuning? How much re-tuning is needed? How do you test that?

I think that issue is orthogonal to this one. This cannot wait because there’s a release coming soon. We need to decide if the PR has compromised the integrity of the code base before that. Questions about the integrity of the developer process can wait.

2 Likes

Wouldn’t this be better as a design-doc (in that repo) so people can collaborate and comment on it in an easier way? (iirc github wikis have history but don’t have proper collaboration and commenting)

Is this a “ride or die” change? Or will Stan 2.21 be fine with out it? If it’s not a ride or die change the most conservative thing be to say that it did not clearly/cleanly/uncontroversially pass review by the deadline and so it shouldn’t be in 2.21?

3 Likes

Good question. I don’t think it came up during the review. @betanalpha, I think you have some opinions on this. Please share.

Thanks for getting the ball rolling on this.

I agree with @nhuurre. I think we do need something like this and it looks good, but I doubt that having it before would have avoided the issue over the recent PR.

It seems to me that there’s a difference between what should be part of the official unit tests and what is reasonable to ask to see during a PR review. That is, even if it doesn’t make sense to add a certain test to the unit tests that will always be run (because it’s not part of the guaranteed API), that doesn’t imply that it’s unreasonable to ask for certain checks to be done that are relevant to the consistency of the user experience.

2 Likes

My understanding from the bit of the thread before it derailed: It seems to give a 5-10% speedup (measured in ESS/gradient) for some models, with basically the same behaviour for others. Best speed up is for things like iid or lightly correlated normals (or ts with high dof). The cost of this is more divergences, which are caused by the worse resolution of the symplectic integrator due to a higher step size. Summary here

1 Like

I’m not sure what would count as “ride or die”. Do you just mean “absolutely urgent”? Is there anything other than a bug fix that would qualify?

That’s not up to me to define. But I’d say that there is a release date coming, there is a controversial feature, and the idea of the freeze was to make it very very hard for something to slip through the net. So does this PR violate the principles underlying the code freeze. And if so, is this worth taking a risk on?

If you want my specific view, I’d say that it should not be in 2.21. I’m basing this on the principle no one should be able to approve their own PR or decide when it has been reviewed enough (this is materially different from deciding this on someone else’s PR). If Sean had done a secondary review before merging, it might be different, but he didn’t. I also think it’s awkward that the review didn’t happen on the PR, which probably made it easier for this to happen.

2 Likes

To me feature/enhancement PR should rarely be “ride or die”, and ample discussion & proper resolution has priority over procedure override.

3 Likes

This is exactly the point of defining the API more carefully – it establishes what we guarantee to users and what we don’t, which also defines what is testable. This is particularly challenging given the breadth of models a user can specify, and hence the breadth of ways the algorithms can interact with that model – all we can say is that if the tuned fit doesn’t flag any diagnostics then they’re probably okay, and if the tuning no longer avoids diagnostic failures then they will have to retune.

Design documents at the moment are used for new features. Since this is just formalizing what is already defined by the established testing burden I went with a Wiki and this Discourse thread for discussion.

No code is ride or die. I would very strongly prefer to have this in so that 2.21 can define the new sampler going forwards and allow us to start reducing the confusion about what algorithms are being run under Stan’s hood.

Definitely. Given the challenge of integrating testing the algorithms we have relied on tests like that for some time. The bug a few years ago was discovered by a user who reported odd behavior, but only for variances and only for effective samples sizes of order O(10,000) which far exceeding what we had been testing with. Similarly a small bug was found in the additional checks PR from last month due to informal tests run by Ben that again exceeded what we had been testing (pushing up into a million to ten million effective samples). For the stepwise adaptation PR Aki also requested an informal test that was within the scope of the algorithm library which I ran. These requests were all welcomed and acknowledged.

The difference here is that the requested test did not probe well-defined behavior, nor did it provide any guarantee that the generic user experience would be unchanged. See my long explanation in the other post.

Ok. How about a markdown doc on the repo so that the comments and discussion are kept with the final product. It would make it easier to go through line by line. It really needs it. All kinds of stuff isn’t really true.
Example:
If Stan-provided diagnostics do not fail then Markov chain Monte Carlo estimators of square integrable functions will satisfy a central limit theorem for sufficiently long chains.

The Stan diagnostics are not nearly powerful enough to support that sort of statement.

1 Like

Correct, but it’s the soft guarantee that we give our users. Technically we can’t guarantee any behavior aside from that of target distributions we have already run or analyzed theoretically. We do, however, tell our users to expect a CLT if no diagnostics flag and are then responsible for violations thereof.

The API is a technical design document not a marketing pitch.

2 Likes

Perhaps the correct version is

If Stan-provided diagnostics do not fail then Markov chain Monte Carlo are as reliable as we can make them for chais of the length the user has run.

I also disagree with you strongly that we don’t need to ensure the diagnostics aren’t a) flagged in error or b) everything doesn’t go wild after they’re flagged. Skipping b) would be fine if the diagnostic was infallible, but it isn’t.

1 Like

The design doc page is based on the Rust Request For Comment (RFC) scheme which is made for more than features. imo I think it’s a much nicer place for this sort of thing so that people can comment line by line. Comments get more structured context. We can also tag all relevant parties as reviewers who can approve, decline with comments, or abstain.

regarding APIs and good API design, I think there is a lot of good stuff here:
https://apiguide.readthedocs.io/en/latest/preface.html

User empathy is focused on ergonomics at the point of delivery. Developer empathy is focused on painless systems integration. Good API design is still fundamentally about ergonomics, but the context is different. As a user, successful design has the quality of affordance (obviousness, “don’t make me think”). For developers, good API design also incorporates this quality (through idioms and good documentation). Further, good API design allows developers to integrate their applications with your systems in a decoupled manner; they are able to reuse the resources exposed by your API in previously unimagined ways. Lastly, good APIs are stable and adhere the ‘principle of least surprise’: developers are able to rely on your API to behave in a predictable manner.

The gold standard in developer empathy is found in thriving open source projects. Collaboration, peer review and a responsive community are the hallmarks of such projects. Organisations that do well in this space nurture their relationship with developers as though their survival depends on it (quite often because it does). This does not describe Australian Government IT of today, but hopefully describes the “government as a platform” of tomorrow. Some things are certain: APIs are important, they can generate additional and significant value, developer empathy is critical to successs and we need all the help we can get.

(empahsis mine)

Michael has written a nice wiki page, but this is not an API, as the term is used in software engineering. I would call it a requirements spec. It can be used as the basis of a test plan, but it is no substitute for comprehensive tests. If reviewers have asked for tests that document how changes to the algorithm affect the system behavior and end user experience, the collaborative response is to write them.

5 Likes