Following up with @seantalts’s request I put together a summary of what the algorithms in Stan can and can’t guarantee, https://github.com/stan-dev/stan/wiki/The-Stan-Algorithm-API. This defines the boundaries of well-defined tests and is consistent with our current testing frameworks, as well as more elaborate testing frameworks that have been discussed.
Thank you! We need an API to do proper semantic versioning (and answer, at the policy-level, questions about doc and testing) and this doc looks like an awesome starting point.
I want to tag a bunch of people who might care about this. My goal here is to quickly see if we can all mostly agree on this and use it to evaluate the testing of the recent PR. If it becomes clear that there’s a lot of contention on parts of the API relating to the PR then I will pull it out for 2.21 and we can continue the policy-level API discussion without that time pressure. So think of this as a quick gut check from the folks who care a lot about the algorithms API (and obviously feel free to comment and tag folks I’m momentarily forgetting).
If Stan-provided diagnostics do not fail then Markov chain Monte Carlo estimators of square integrable functions will satisfy a central limit theorem for sufficiently long chains.
In particular, if the effective sample size is estimated accurately then for any real-valued function f
Are those statements consistent? The first one applies to MC with geometrical ergodicity + detailed balance, the second one is valid for almost surely bounded Borel function f with a stronger assumption on the MC’s ergodicity(polynomial ergodicity, IIRC).
Good call! The CLT requires square integrality (technically (2 + \epsilon)-integrability in some cases) for the variance, and hence the MCMC-SE to be well-behaved. Technically geometric ergodicity is not a necessary condition for a CLT but the exceptions are, well, quite exceptional. In particular the polynomial ergodicity results typically aren’t of general applied utility.
I updated the second statement to explicitly include the square integrability condition. Thanks!
Looks good but I don’t think this would have been sufficient to preempt the incident.
The question there was stability between releases. If a user has a working application built on Stan 2.21, when can they expect it to continue to work if they upgrade to Stan 2.22? I guess the expectation is that any model that works with the default settings will continue to work with the defaults. But what if a model requires a lot of tuning? How much re-tuning is needed? How do you test that?
I think that issue is orthogonal to this one. This cannot wait because there’s a release coming soon. We need to decide if the PR has compromised the integrity of the code base before that. Questions about the integrity of the developer process can wait.
Wouldn’t this be better as a design-doc (in that repo) so people can collaborate and comment on it in an easier way? (iirc github wikis have history but don’t have proper collaboration and commenting)
Is this a “ride or die” change? Or will Stan 2.21 be fine with out it? If it’s not a ride or die change the most conservative thing be to say that it did not clearly/cleanly/uncontroversially pass review by the deadline and so it shouldn’t be in 2.21?
I agree with @nhuurre. I think we do need something like this and it looks good, but I doubt that having it before would have avoided the issue over the recent PR.
It seems to me that there’s a difference between what should be part of the official unit tests and what is reasonable to ask to see during a PR review. That is, even if it doesn’t make sense to add a certain test to the unit tests that will always be run (because it’s not part of the guaranteed API), that doesn’t imply that it’s unreasonable to ask for certain checks to be done that are relevant to the consistency of the user experience.
My understanding from the bit of the thread before it derailed: It seems to give a 5-10% speedup (measured in ESS/gradient) for some models, with basically the same behaviour for others. Best speed up is for things like iid or lightly correlated normals (or ts with high dof). The cost of this is more divergences, which are caused by the worse resolution of the symplectic integrator due to a higher step size. Summary here
That’s not up to me to define. But I’d say that there is a release date coming, there is a controversial feature, and the idea of the freeze was to make it very very hard for something to slip through the net. So does this PR violate the principles underlying the code freeze. And if so, is this worth taking a risk on?
If you want my specific view, I’d say that it should not be in 2.21. I’m basing this on the principle no one should be able to approve their own PR or decide when it has been reviewed enough (this is materially different from deciding this on someone else’s PR). If Sean had done a secondary review before merging, it might be different, but he didn’t. I also think it’s awkward that the review didn’t happen on the PR, which probably made it easier for this to happen.
This is exactly the point of defining the API more carefully – it establishes what we guarantee to users and what we don’t, which also defines what is testable. This is particularly challenging given the breadth of models a user can specify, and hence the breadth of ways the algorithms can interact with that model – all we can say is that if the tuned fit doesn’t flag any diagnostics then they’re probably okay, and if the tuning no longer avoids diagnostic failures then they will have to retune.
Design documents at the moment are used for new features. Since this is just formalizing what is already defined by the established testing burden I went with a Wiki and this Discourse thread for discussion.
No code is ride or die. I would very strongly prefer to have this in so that 2.21 can define the new sampler going forwards and allow us to start reducing the confusion about what algorithms are being run under Stan’s hood.
Definitely. Given the challenge of integrating testing the algorithms we have relied on tests like that for some time. The bug a few years ago was discovered by a user who reported odd behavior, but only for variances and only for effective samples sizes of order O(10,000) which far exceeding what we had been testing with. Similarly a small bug was found in the additional checks PR from last month due to informal tests run by Ben that again exceeded what we had been testing (pushing up into a million to ten million effective samples). For the stepwise adaptation PR Aki also requested an informal test that was within the scope of the algorithm library which I ran. These requests were all welcomed and acknowledged.
The difference here is that the requested test did not probe well-defined behavior, nor did it provide any guarantee that the generic user experience would be unchanged. See my long explanation in the other post.
Ok. How about a markdown doc on the repo so that the comments and discussion are kept with the final product. It would make it easier to go through line by line. It really needs it. All kinds of stuff isn’t really true.
Example: If Stan-provided diagnostics do not fail then Markov chain Monte Carlo estimators of square integrable functions will satisfy a central limit theorem for sufficiently long chains.
The Stan diagnostics are not nearly powerful enough to support that sort of statement.
Correct, but it’s the soft guarantee that we give our users. Technically we can’t guarantee any behavior aside from that of target distributions we have already run or analyzed theoretically. We do, however, tell our users to expect a CLT if no diagnostics flag and are then responsible for violations thereof.
If Stan-provided diagnostics do not fail then Markov chain Monte Carlo are as reliable as we can make them for chais of the length the user has run.
I also disagree with you strongly that we don’t need to ensure the diagnostics aren’t a) flagged in error or b) everything doesn’t go wild after they’re flagged. Skipping b) would be fine if the diagnostic was infallible, but it isn’t.
The design doc page is based on the Rust Request For Comment (RFC) scheme which is made for more than features. imo I think it’s a much nicer place for this sort of thing so that people can comment line by line. Comments get more structured context. We can also tag all relevant parties as reviewers who can approve, decline with comments, or abstain.
User empathy is focused on ergonomics at the point of delivery. Developer empathy is focused on painless systems integration. Good API design is still fundamentally about ergonomics, but the context is different. As a user, successful design has the quality of affordance (obviousness, “don’t make me think”). For developers, good API design also incorporates this quality (through idioms and good documentation). Further, good API design allows developers to integrate their applications with your systems in a decoupled manner; they are able to reuse the resources exposed by your API in previously unimagined ways. Lastly, good APIs are stable and adhere the ‘principle of least surprise’: developers are able to rely on your API to behave in a predictable manner.
The gold standard in developer empathy is found in thriving open source projects. Collaboration, peer review and a responsive community are the hallmarks of such projects. Organisations that do well in this space nurture their relationship with developers as though their survival depends on it (quite often because it does). This does not describe Australian Government IT of today, but hopefully describes the “government as a platform” of tomorrow. Some things are certain: APIs are important, they can generate additional and significant value, developer empathy is critical to successs and we need all the help we can get.
Michael has written a nice wiki page, but this is not an API, as the term is used in software engineering. I would call it a requirements spec. It can be used as the basis of a test plan, but it is no substitute for comprehensive tests. If reviewers have asked for tests that document how changes to the algorithm affect the system behavior and end user experience, the collaborative response is to write them.