Request for Volunteers to Test Adaptation Tweak

User here! Cool work @betanalpha. :)

I think what would be useful for me (and I might just be completely repeating @avehtari here) would be a mapping from old adapt delta to new adapt delta in terms of step size. Essentially if I upgrade rstan and find a model that fit fine with no divergences with old adapt_delta=.90 and now I have to use new adapt_delta=.9999, I’d worry that I’d stuffed something up. It may not be possible to do that completely but even an - in the order of estimate would be super helpful.

4 Likes

tl;dr It’s not possible to estimate that directly. All one can do is increase adapt_delta until one recovers the same step size as before, in which case the sampler behavior will be the same (up to expected MCMC variation). Even that, however, might not give the behavior that people have come to expect due to confusion in what exactly to expect.

High Fidelity

One of the conflicts here is exactly what to expect from an algorithm, or perhaps better what guarantees our algorithms provide.

The HMC sampler in Stan makes only one guarantee, and it’s a soft one at that: if there are no diagnostic failures then the MCMC estimators of the expectation values of square integrator functions maybe follow a central limit theorem with standard error given by \sqrt{ \frac{ \text{Var}[f] }{ \text{ESS}[f] } }. For now let’s be positive and elevate that maybe to a will and see what the consequences are.

Even under ideal circumstances the sampler does not provide bitwise reproducibility. Seemingly irrelevant or insignificant changes will alter the exact states given at each iteration of the Markov chain, yielding completely different outputs. Still, the hope is that the MCMC estimators themselves remain reasonable.

Even under ideal circumstances the MCMC estimators follow a central limit theorem which means the estimator +/- standard error intervals cover the true value only with probability. You run enough chains and those intervals won’t capture the true values eventually.

When the diagnostics fail? Well then all bets are off. The exact consequences of a model failure depend on the intimate details of the model and the mode of failure, and without being able to accurately estimate expectation values we don’t have any way to faithfully quantify what the consequences might be.

These points are all extremely important for both testing and setting user expectations. We can only test what the algorithms guarantee and we want users to expect only what the algorithms can guarantee. Let’s go deeper into that with regard to the step size adaptation and divergences.

You Better Watch Your Step

It is a great miracle in MCMC that for some algorithms we can construct (approximately) universal optimality criteria. In other words there is a function of some tuning parameters for which a specific value should yield nearly optimal performance. For Metropolis algorithms, or Metropolis-adjacent algorithms, these often take the form of expected acceptance probabilities.

So even if we don’t know what the optimal step size will be we can find the step size that approximately optimizes performance by tuning until that optimally criteria achieves the desired value. The relationship between the optimality criteria output and the step size? That is highly nonlinear and depends on the specific structure of the model – about the only thing we know is that it’s monotonic. So even under ideal circumstances there’s no way to predict what will happen to the step size when the adaptation targets a different value of the optimality criterion. In Stan parlance, even if you know what epsilon is needed to achieve adapt_delta = 0.8 you can’t say anything about what step size is needed to achieve adapt_delta = 0.9 other than it will be smaller.

One condition of these criteria is that they presume the same ideal circumstances that we presume to get those central limit theorems. In other words any adaptation algorithm that targets these criteria will be valid only under those ideal circumstances, and hence only if we don’t see any diagnostic failures. Lots of things go wrong when the sampling breaks down.

First and foremost the sampler is no longer (asymptotically) unbiased and so the definition of performance has to change, which means that the nominal target value, like adapt_delta(epsilon) = 0.8 will no longer be relevant.
Secondly the exact functional dependence of the optimality criterion on the step size will change, so even if you could map out the relationship under ideal circumstances that relationship wouldn’t persist to non-ideal circumstances. And finally even if none of those were a concern we have to recognize that the optimality criterion, again adapt_delta in Stan, is an expectation value that we estimate. When the diagnostics start to fail we know we can’t trust our estimates, so even if the optimal value was still adapt_delta = 0.8 then we wouldn’t be able to achieve that in practice because we wouldn’t have accurate estimates of adapt_delta itself.

Phew. Again, once the diagnostics fail the behaviors can be…beyond belief.

(What’s So Funny 'Bout) Peace, Love, and Understanding

Let’s bring this all together and discuss the consequences for divergences and the adaptation procedure.

When you see divergences you know something weird is going on, and you have no way to quantify what the consequences of that weirdness will be on your MCMC estimation. As discussed above once the adaptation target itself changes, the optimal value is no longer well-posed, and, and we can’t estimate the adaptation statistics all that well anyways.

At this point everything about the adaptation criteria changes interpretation; we are no longer trying to achieve a certain value of the acceptance statistic. At this point we just want to reduce the step size in a desperate attempt to maybe sort of give the sampler enough resolution to explore without diverging, and for that we just need to rely on the monotonic relationship between the optimality criterion and the step size which fortunately is about the only thing that persists when divergences arise. Again, once divergences arise the meaning of adapt_delta changes and the goal for setting epsilon changes.

Will this help? Sometimes. Maybe.

By reducing the step size we make the numerical integration more accurate, which might bring the integrator into a region of stability where divergences no longer occur. Or it might let the sampler explore a pathological region even better and increase the number of divergences. Everything will depend on the exact structure of your model.

Now how can we tell what is happening and whether we can trust the results? Well the only signal we have is the residual divergences, but those divergences are an empirical measure – the sampler divergences only if it happens to explore near a pathological region. The smaller the pathological region the less likely we are to sample it, and hence the less likely we are to see a divergence even if one is there. Reducing the step size might shrink a pathological region but at the same time it might make it harder to see.

Yes you might have fewer divergences for the default 4000 iterations with the same seed, but that’s no guarantee that you won’t see divergences if you run longer or use a different seed (it’s always fun to see a divergence pop up in a course exercise, but only on one person’s computer).

Ultimately this is why no divergences will never be a sufficient condition for identify ideal circumstances for HMC estimation. Empirically they are very sensitive, but at some point you can no longer tell if you’re not seeing any because the pathological regions are tiny or because the pathological regions don’t exist (and at the moment we don’t have good enough theory to say whether small enough pathological regions lead to negligible error).

Anyways, if you see divergences and increase adapt_delta a little bit to achieve a smaller epsilon and the divergences go away then quickly then you’re probably okay, especially if you don’t need more effective samples than what you recovered in the fit. That general behavior will persist with the new adaptation, but it’s impossible to quantify how the numbers will change.

If you push adapt_delta up higher and you keep getting a few divergences so you go even higher, and then higher, and so on. Well in that case you should be more worried. Not only will the sampling start becoming more expensive, it’s less likely that you’re actually fixing anything in a robust way.

Watching the Detectives

Let me try to summarize the state of things to put the discussion of the pull request into context.

The sampler in Stan makes only a few actual guarantees, and weak guarantee at that. We basically just have that

  • If there are no diagnostic failures then you have some evidence that the MCMC CLT holds and the MCMC standard errors will have the expected statistical coverage.

  • If you see divergences then all bets are off, but you might be able to fix things by decreasing the step size. The step size can be decreased by increasing the stepsize adaptation target above 0.8.

That’s it. In particular how much the stepsize adaptation target has to be increased is not guaranteed to be any given number of even constant. The only contract we provide to the client (you, the user) is that adapt_delta ad epsilon have some monotonic relationship, so you can shift one systematically by moving the other in the opposite direction.

None of those guarantees change with this pull request. Under ideal circumstances the MCMC estimators and standard errors will still have the expected coverage, and if divergences appear then they might be able to be moderated by decreasing epsilon which can be done by increasing adapt_delta.

I know that these guarantees are not particularly comforting, but that’s what we have (and we should be lucky to even have those!). At the same time I know that users are prone to building up empirical expectations; if we accommodate these then we freeze the sampler in an arbitrary state. We have to be clear about what we guarantee and try to temper any inaccurate expectations that might arise.

Anyways, I hope that clears things up a bit.

5 Likes

Thanks Mike for the very nice text, but that did not answer the question. That text is great addition to explain the complexities for those who didn’t already know them, but I think it would be good to quantify the order of magnitude of the change. I guess I could pick that from that many, many comparisons pdf, but I assume you have the numbers somewhere and thus I assume it would be easier for you to estimate that summary. It might be sufficient to give just one value, e.g. stepsizes tend to be 1.4 times bigger over different posteriors and adpat_delta values (no need to discuss how that would affect potential divergences). I thought showing that also for some different adapt_delta’s would better illustrate the points you make verbally in your text and that there is no simple functional relationship how the change affects stepsizes. Maybe show the result even for couple posterior to illustrate that this depends on posterior and thus there is no guaranteed behavior for other posteriors.

Checking the thread backwards, I found @bbbales2 results, which could summarized as “1-1.5 times increase in step size leading usually to a smaller increase in ESS/s, and rarely makes difference whether there are observed divergences”

2 Likes

I’m confused because these seem to be contradictory statements.

Because the relationship between step size and adapt delta is nonlinear and model-dependent, and because we don’t have a “standard model”, there is no quantification that we can quote in general. And that’s in the ideal circumstances where we have an MCMC CLT. You can see the large variation in different adapted step sizes in the original tests that I ran.

@bbbales2’s last message was regarding what happens when divergences arise, so outside of those ideal circumstances. There we have even fewer guarantees and even more potential for variation.

Sorry if my English is confusing.

When users are going to ask about the change I’m going to refer to your long detailed text, but I’m also going to give a short answer (I guess others would give some short answer, too). I know you have difficulties writing anything short, but I’m asking your help to write an elevator pitch for this, which would be short and as informative as possible. Currently my choices are

  • “1-X times increase in step size leading usually to a smaller increase in ESS/s, but rarely makes difference whether there are observed divergences”
  • “Based on Mike’s detailed discussion, Mike’s prior for the expected increase in step size, ESS/s and effect to the probability of seeing divergences is uniform”

I hope you could help writing a better summary (max four sentences, which are not just pointing to a four hundred sentence description). It’s fine to have YMMV disclaimer.

For most users, especially those using RStan and PyStan where the sampler parameters like step size and adaptation stat aren’t immediately accessible, all they will see is a small performance boost. So the general statement would just be

If you don’t see any diagnostic warnings then your performance will improve by a small but nontrivial amount.

The change will be pretty opaque to most users.

For those who know more about the sampler you can say

The previous step size adaptation was too conservative, but this new change makes the adaptation more aggressive and closer to optimal under ideal circumstances. You’ll see the step size increase for a fixed adapt_delta, and typical fewer leapfrog steps per iteration without compromising the effective sample size.

Those who have encountered divergences may need to retune.

If you have used adapt_delta to suppress divergences in the past then you may need to increase adapt_delta further to achieve the same suppression.

Or for those familiar with HMC,

The more aggressive adaptation will result in higher step sizes for a fixed adapt_delta. If you’ve used adapt_delta to reduce the integrator step size and suppress divergences in the past, then you’ll need to increase it further to achieve the same lower step size and divergence suppression.

How’s that?

4 Likes

Just pick a model and run the tests. It’s easy. Find something where divergences go away with the old code as you increase adapt_delta and repeat the exercise with the new code.

You’re telling me it shouldn’t be a problem. I believe you (and I believed you before I asked for these tests). I am only asking you to double check this. I don’t even care what the model is as long as it meets the specification above.

There are a fair number of reasons I think it’s fine that I ask for these tests. If anything the testing process I’ve put this through has been lenient. The e-mail you sent me on October 7th seems to suggest you disagree. But aside from that, as someone reviewing your pull requests, I want to see more self-reflection in the pulls you are making and less combativeness of the reviews. I also want you to respect the opinions of people other than yourself on these issues.

I don’t like that you are resisting review. Your uturn pull request from a couple months ago actually would have broken Stan: https://github.com/stan-dev/stan/pull/2800#issuecomment-525889921. And it’s okay that there was a bug. We all screw stuff up. What is important is we were able to work together and find it (those were your tests I was running, too, so I’m not taking credit here). Getting immediate pushback on some incredibly simple tests that we all expect should pass and are easy to do is really frustrating given this close history.

It is not fun when you’re doing a careful review to find problems and be told that they aren’t problems: https://github.com/stan-dev/stan/pull/2800#discussion_r314950803. I’m trying to help this stuff go smoothly and make sure there aren’t mistakes that come back to bite us later.

If you think there’s something wrong with this process, I’m happy to have this go before the SGB. If they tell me I’m being unfair in asking you to do tests, that’s fine, I’ll respect that. Again, the tests are easy, and I fully expect them to pass.

Also, is this code in a repo anywhere? This is the pull that I should be watching, right: https://github.com/stan-dev/stan/pull/2790 ? The last that was updated was August 8th and I think the Uturn stuff got finished on the 30th so I think it’s out of date.

1 Like

Ben, this isn’t about whether or not the check you requested is easy or not, it’s about whether the test is relevant to the Stan algorithm API. The responsibility of a reviewer is to review a pull request within the context of the existing testing policy, not whatever testing policy they deem fit. Reviewers do not have absolute authority within any particular PR to set that policy, let alone the authority to demand anything from submitters outside of the context of that PR.

As I have extensively written about in this thread, the particular behavior once a divergence occurs is not guaranteed by the HMC API. There is no guarantee that increasing adapt_delta will reduce the number of divergences and, if it does, by how much it will reduce the number of divergences. Empirical expectations of divergent behavior based on baseline behavior are irrelevant and do not factor into the testing burden of current code or future code. I am speaking as the algorithm lead here, however poorly the TWG leads are currently defined, not as the submitter of this PR.

Adding tests outside of API-guaranteed behavior does nothing to improve the stability of the code but absolutely puts undue burden on submitters and slows development. It also sets precedent that reviewers can impose tests of undone behavior, implicitly forcing behavior expectations without discussion or consensus. Showing that increasing adapt_delta reduces divergences on one model isn’t lenient – because that isn’t even guaranteed behavior its success would be irrelevant to other models and if anything give a false sense of security.

Within the scope of the API-guaranteed behavior is a valid MCMC CLT provided that diagnostics are not flagged (even though that can’t be fully guaranteed we assume it for the tests). So extensive CLT tests up to millions of effective samples that go beyond what any other MCMC code has been tested again? Yes, please! Welcomed and appreciated. Extensive performance tests using a definition of performance (effective sample size per gradient) valid within the CLT? Also yes, please!

Another behavior that has been built into the adaptation that should be universal is that increasing adapt_delta decreases the step size in expectation. @avehtari asked for that and I was happy to show the results for a test model to verify the analytic result that the current statistic bounds the baseline statistic.

Divergent behavior occurs in the iteration of the step size, the integrator, and a model. Not only is none of that code touched in this PR, the behavior itself has no guarantees because of the variation due to the model as I have repeated multiple times.

Speaking as a nominal tech lead and someone on the SGB, this matter is not within the scope of the SGB. The tech local tech lead, myself, sets local policy and conflicts between tech leads go up to the technical working group director.
Again I am not speaking as a PR submitter in my critique of the test you are requesting, I am speaking as the algorithm tech lead.

Exactly the problem. I can absolutely find a model which achieves the behavior for which you are looking, but that says nothing about the validity of the old code or the new code which is why it is an irrelevant test.

It’s been sitting on a new branch for the past few weeks. I created a new pull request for the branch at https://github.com/stan-dev/stan/pull/2836.

So let me go ahead and summarize this.

  1. As a reviewer, I asked for a test (Request for Volunteers to Test Adaptation Tweak)
  1. As a developer, you suggested I could do this test myself (Request for Volunteers to Test Adaptation Tweak)
  1. I asked you to do this test yourself. As a reviewer I felt I’d already done too much testing here (Request for Volunteers to Test Adaptation Tweak)
  2. As a tech lead you sent me an e-mail suggesting that the tests I was asking from developer-you were inappropriate
  3. I contested that and provided evidence why developer-you is fallible (Request for Volunteers to Test Adaptation Tweak) and said that if tech-lead-you wanted to contest this we could talk to the SGB.
  4. Tech-lead you reiterated that again this wasn’t about developer-you
  1. Regarding the SGB, SGB-you stepped in to say this also didn’t involve the SGB, and in fact the conflict between tech-lead-you and myself (not a tech lead) should be resolved by the local tech lead (you, in this case).
  1. Tech-lead-you verbally approved the pull request from developer-you and requested that it be merged by the technical working director @seantalts (https://github.com/stan-dev/stan/pull/2836). There were no external code reviews at this point. It was merged 30 minutes later.

In paragraph form, I asked, as the reviewer, you to verify the behavior of adapt_delta. This was based on input from @Bob_Carpenter in this thread, and everyone who came to the Stan meeting (which is publicly announced Request for Volunteers to Test Adaptation Tweak). In that meeting, everyone basically thought the changes sounded good (@jonah and @bgoodri in particular), and @jonah agreed we should at least verify adapt_delta behavior. @ariddell and @ahartikainen never responded. I think that addresses this complaint from tech-lead-you:

So I asked developer-you to verify the adapt_delta behavior. And I understand that divergences are an indicator that things have gone awry, but considering these changes to the timestep adaptation are most different for problems with divergent trajectories, I judged that it would be prudent to test the behavior of the new adaptation on problems where divergent transitions arise, regardless of the statistical validity of the output. Regarding the central limit theorem, I don’t believe warmup as implemented in Stan is reversible, and so I don’t know how much these arguments about CLT and whatnot apply (given that timestep adaptation is happening in warmup).

That’s what lenient means.

If the relation between adapt_delta and timestep is monotonic, then why didn’t we simply lower the default adapt_delta target with the previous implementation? If we’re only testing on models without divergences, what’s the problem? We can achieve the same difference right?

Oh okay, that’s cool (Request for Volunteers to Test Adaptation Tweak).

MODERATOR EDIT: The following section contained personal attacks which have been removed (by @martinmodrak)

So then all this difficult work you’ve done [here] is to avoid writing tests for a pull request that could be replaced by lowering the adapt_delta defaults?

I just did the tests myself in 10 minutes. They seem fine to me. I reject this pull request.

5 Likes

I’ve been following this thread and the last two weeks have been rough. Several times I felt that I should say something but I also felt that my interpersonal skills aren’t great and I couldn’t explain people’s points better than they had. Today I finally I decided to intervene and spent several hours composing a response only to discover that it was all moot as the PR had already been merged.

In case anyone cares, here is the comment I was going to post five hours ago

Just to be clear. The last line means: Don’t merge.

@seantalts: There’s no public record of the reasoning behind the merge. Can you explain your reasons here?

6 Likes

Sorry for not answering after my initial comment.

I don’t have anything meaningful to say about this matter, too busy with the PyStan (2.20, Sundials)

1 Like

Hey all, let’s remember to keep the tone civilized. Regardless of our disagreements we want this board to be welcoming and remember that at the end of the day, we’re all on the same team here and have extremely similar goals for scientific computing.

@betanalpha is the domain lead for algorithms and has the authority; he just needed me to actually press the merge button. I don’t have any comments on the contents of the PR.

If you or @bbbales2 want to challenge his decision there with the TWG Director (me for now, maybe we change the whole structure soon but let’s talk about the current process) you should feel free. I’ll take your post as essentially starting the ball rolling on that; if that’s not your intent let me know.

I’m not familiar with this code but I’m familiar with this kind of debate; it’s come up a bunch of times in other modules and in many other jobs as a software engineer - we’re essentially arguing about the API of Stan w.r.t. the main HMC sampler. For the Math library, the last time we had an argument about the API we went ahead and formally defined it (late; it had been required since the decision to adopt semantic versioning across the entire project, but never fleshed out).

Since we don’t have an HMC or algorithms API yet, I’d like to ask @betanalpha as the lead to draft one that we can get comments on in the next few days and use to decide if this PR breaks anything in that API and requires a major version bump. If it does, we can pull it out of this 2.21 release before Friday.

I think once the API is specified explicitly, it will be much more clear what kinds of tests are in-scope for any given PR and what are out-of-scope. These tough decisions must still be made but ideally not totally on a PR-by-PR basis. Judgment will always be involved, but everyone will be happier if there are policies and guidelines for these decisions so everyone knows what to expect.

The algorithms are a really tricky part of Stan because it really requires a lot of diligence and mathematical background to understand why the code works the way it does. That said, I’m a huge fan of finding ways to empirically test anything we want to promise to our users to make sure we know when it changes (and we can make those tests easy to update if need be). So maybe we can find a compromise here where we figure out what we really think is important to promise users and come up with some innovative new tests that verify that.

1 Like

Thanks for the response.

I don’t mean to challenge the decision. I don’t believe this breaks anything, and there’s time for bug fixes if anything comes up.

This discussion was going great until the deadline was on the horizon. But when @betanalpha started to push hard @bbbales2 pushed back harder and the review process failed. I was hoping you’d explain why getting this in in 2.21 release was important enough to bypass the usual review process.

I think this discussion can wait until after the release, when there’s a general discussion on whether the feature freeze worked.

No, you’re complicit in this. @nhuurre and I clearly laid out our objections yesterday. The pull request hasn’t been reversed, and Michael has made no apologies.

Let’s save the effort here. Michael clearly has abused his power in the project to force this through. I don’t think there’s any question what conclusion he will reach given the additional power to write an API governing his pull request.

That’s how tech-lead Michael frames the discussion, which is incomplete. This discussion is about how developer-Michael refused to respect a request for simple tests made by the Stan community and how he pulled the levers of power granted him as tech lead and SGB member to avoid this.

Up until two days ago Michael had the option of resolving this with a couple simple tests (that again, I have since done and verified they passed).

That’s not the situation now. I’d like to move to have Michael removed from his position as tech lead and his position on the SGB. I want that removal to be permanent.

I believe his behavior in a leadership role on this project is counterproductive to cultivating a collaborative development environment. If Michael is so willing to work so hard to fight the collective opinion of the project on this simple issue, then how is he going to treat someone new to the community? While I respect his technical contributions, I do not believe we owe him a governmental role because of these. In my opinion he should be welcome to remain a member of the community outside of a leadership role and be proud of his contributions to the project.

Even though the technical argument underlying this specific issue is simple, I believe his willingness to unilaterally ignore the testing requests made in this thread by @Bob_Carpenter, made in the Stan meeting with consultations with @jonah and @bgoodri, and subsequently made in this thread by myself, @lauren, and @avehtari is indicative of an underlying dismissive attitude towards the Stan community. If anyone I’ve listed there thinks I’ve misrepresented them, please say so. My only regret in that process is I meant to ask @paul.buerkner his opinion as well but didn’t. (Also don’t worry @ahartikainen, I know you’re busy and I just took the non-response as an indication that you and @ariddell would agree with the outcome of the review).

I do not want to let this slide because I am someone on the project with few to no personal or professional connections to Michael. He has no power over me, and so I believe it is important for me to act against someone so willing to abuse his position.

I’d like to follow up with the rest of the @Stan_Development_Team and anyone else in the community (apologies for the weekend notification :D). Do you have any complaints about Michael’s behavior as a tech lead or member of the SGB, or is Michael’s position as a tech lead or his position on the SGB making it difficult for you to contribute to the Stan community?

This issue isn’t going to be resolved tomorrow, so please enjoy the rest of your weekend and take your time responding. I don’t want to rush on a decision like this, and if the community largely feels like I’m overreacting, I’m happy to apologize to Michael and stand aside. If you don’t want to respond publicly, you can e-mail me privately at bbbales2@gmail.com.

2 Likes

That’s a question for @betanalpha - I didn’t make that decision. Tech leads can make decisions like that over their domain and then the TWG Director can overturn them if need be, and both of those events should be fairly rare. I really just pressed the button when he asked me to because we don’t permissions set up granularly enough in github to allow @betanalpha to have pressed the button himself.

I am on a break from Stan-related things right now, but just because it’s probably not clear unless someone says it, you need to contact the SGB directly if you want them to see this. (I am no longer on the board, but I was in the meeting that set up that not necessarily that well publicised email address). Their email is board@mc-stan.org.

All,
I apologize for locking this thread because it limits the ability of people to express their opinions. Please give the SGB a chance to respond. I spoke with @bbbales2 and he is ok with this. Please feel free to communicate with board@mc-stan.org about this.

We have a SGB meeting Monday, I will update after that.

I have attempted to un-flag the relevant post Request for Volunteers to Test Adaptation Tweak but failed so it remains grey.

thanks

Breck

The SGB is aware of the complaint and is forming a committee…

Breck

1 Like