Timed iteration updates and backward compatibility standards

@bgoodri is proposing to change the way Stan services report iterations from an iteration-based update to a time-based update. The proposal is to change the behavior of refresh = N to now mean every N seconds. The proposal is a default of 30s. So maybe an initial report (?) and then the next update at 30s by default. @andrewgelman has been requesting behavior like this.

There’s been ongoing discussion on the issue tracker for Ben’s PR, but I thought I’d bring it up here, which is where we should be having this discuss. So I’m going to move over my last summary from the issue; the quoted bits are from Ben.



The backwards compatibility standard you are applying is inconsistent with the standard of backward compatibility that has been applied in the past.

It’s OK for us to change standards. What we need to do is make them more clear at any given time.

I do not think the right thing to do is try to enforce bit-level release-over-release backward compatibility.

In addition to the chain ID hokey pokey, we have changed the number of digits in the print output, changed the way n_eff is calculated, and are changing the way R_hat is calculated. In each of those situations, we didn’t introduce another flag that defaults to false in order to keep doing things the old way unless the user flips it on; we forced people to do things the new way because it is better.

It might be helpful to separate out the concerns here.

I think the ID “hokey pokey” was a bug that got through due to insufficient testing.

The n_eff thing changed the estimator, not what’s being estimated. So nothing in the API changes meaning.

The R-hat replacement’s a work in progress. If it ever does come up, I will be OK deprecating R-hat support and adding a second diagnostic that we recommend. I’m also OK changing the default output in various ways.

My concerns about this particular pull request are:

  1. Backward compatibility
    refresh=N changes meaning from N iterations to N seconds

  2. Non-deterministic behavior
    With the same seed, get different number of iteration updates.

I’m OK with (2) but wanted to check with @wds15, since he’s been the one that’s been most constrained by “reproduciblity” at this level. My guess is that you (Ben) are right and moving to a non-deterministic behavior is OK.

I’m less OK with (1), which is why I think the simple expedient of allowing control by both iterations and time makes sense. Like with potential R-hat replacements, I think the responsible thing to do would be to deprecate the per-iteration method and add the timed method and then remove per-iteration in a major release.

The progress reports is even less consequential than those things. The only harm is if someone has an old script where they specify refresh = 1000 or something in order to keep the progress output minimal and then they run their old script again, which gets reinterpreted as “refresh every 1000 seconds”, then the progress output will be even more minimal than what they might have been expecting based on Stan’s past behavior. But that isn’t a big deal.

I agree the danger is minimal here. The “harm” in my mind is the perceived (and actual) instability of our interfaces.

I’d like to hear from more devs here than just the two of us on this issue before we decide how to proceed within stan-dev/stan.


The way we’ve set up project governance, @bgoodri will have the final say on the behavior of RStan. But what we’re talking about here is the stan-dev/stan services, which I believe @syclik is managing. So I’d very much like to get @syclik’s sign off on a change this big before merging.

How about having both time and the number of iterations? I would appreciate not getting the current 100 lines of output (with 4x12 progress reports) for cases where the sampling takes just a few seconds, so adding minimum time interval would be great. I would probably like something like 10s time interval, but if the sampling would take, e.g., 1000s I would get much more output (100 progress reports per chain), but I would be also happy with the current 12 lines (appearing every 100s) so I would still like to be able to define also max number of reports (which can be made with refresh now).

Can you elaborate why you would want a fixed number of progress reports (in some cases)?

To me, the progress report is fundamentally a psychological thing for someone who is working interactively. If you are running an unattended script or knitting a RMarkdown document, then you should be ignoring or hiding the progress output anyway. It just takes as long as it takes.

So, if we are talking about someone who is staring at the screen waiting for NUTS to finish, why does that person have a utility function that is maximized at N progress reports? We know that a lot of people have disutility for N progress reports when NUTS takes less than a few seconds anyway. And when NUTS takes more than a few seconds, if you know that the refreshes occur every 30 seconds or whatever, you can roughly extrapolate how much longer it is going to take (which is the real question) from how many iterations were completed in the previous 30 seconds.

This strikes me as expanding the scope of backward compatibility for the sake of preserving backward compatibility. Microsoft, for example, has a reputation for going to great length to preserve backwards compatibility so that its business customers could continue to do things the old way and those things would continue to work, without forcing those business customers to pay the time cost of learning how to do things the new way.

That is not what is at issue here. If you run old code, it still works in the sense that it produces random draws from the same posterior distribution as it did before (even if they are not the same draws). The difference is something very ancillary gets printed at different intervals.

If you add arguments so that you can control refreshing both by time lapsed and by number of iterations, then you are adding a cost on the users of having to figure out how to specify them. Also, their old Stan programs have to be recompiled.

For someone whose old script specifies, refresh = 1000, there is minor cost being put on them in that the code would reinterpret that as “refresh every 1000 seconds”, which is not what the user wanted. But that is a small cost compared to everyone automatically getting the better time-based behavior.

For the applications I am interested in the results from Stan have to be exactly reproducible. That means under the same version, seed, data, chain and machine I want to get the exact same result (numerically, not just statistically).

This has always been the case and we should at least allow this behavior (even if the default is different). From what I understand is being proposed, we should ensure that in the future we can still

  1. get a fixed number of iterations (or a minimal number of iterations the very least)
  2. the output precision should not be affected by some random thing like Neff (we really need a binary output format)

If we are only talking about when things are being printed … that is not of any concern (at least to me). Since Stan is so verbose in its output I usually anyway suppress any output and parse Stan’s output myself for things to take care of.

… if you do time-based outputs then we should probably rethink what we refer to with time given parallelism. So far we report CPU time used, but with threading enabled this does not reflect the actual wallclock time which is shorter due to the use of multiple CPUs. With MPI this is again different; then the CPU time of the root process is OK to use as a measure of time. Still, we should probably switch to wallclock time as this is what matters to users.

I would like the option to limit or completely suppress the output in R. I hate that it fills up my buffer with information that is not useful later.

That is part of the PR. If you specify refresh = 0, then it not only does not print “Iteration: …”, it also does not print “Elapsed Time: …” at the end.

But I interpret this as being consistent with my general view that most users just want less progress output (as opposed to more options) because the progress output is not very useful. For short runs, it is irrelevant and for long runs, an iteration-based refresh scheme does not pertain to the real question of “How much longer is this going to take?”

The PR is keyed off walltime.

My rstan install is broken right now (xcode is having feelings today), so I can’t check this, but would refresh=0 make the output completely silent (except for fit warnings, which should come out as R warnings not a text stream so they can be handled coherently)? Because that’s the option that I want.

Yes. The refresh = 0 eliminates most of the progress output currently and the PR would make refresh = 0 eliminate all of the output. But that is pretty ancillary to Bob’s point about adding a refresh_seconds argument or something like that.

@wds15 is the one I was most worried about in terms of having non-deterministic lines of output.

I think that’s right. This and earlier discussion makes me think we need to segregate our output into that we care about for backward compatibility and that we don’t.

They’re sort of jumbled up right now but we’re in the middle of refactoring that whole bundle of output writers and loggers with @sakrejda and I think this will be an important distinction to keep in mind. Then these decisions won’t need any thought going forward.

That’s a good point. I really don’t like the added complexity. The alternative is changing the meaning of an existing parameter. But it’s a parameter that only controls an output that we don’t care about backward compatibility for.

You need both iterations and time to estimate how much time is remaining. The reason I like to see iterations on long-running processes is to know that something hasn’t crapped out on the back end.

This should get easier with consolidated loggers, but it sounds like the PR in question also has a fix for that. That wasn’t clear to me in the PR text, but then I didn’t read the code.

Bottom Line

Thanks for bearing with the discussion of the implications of this change.

I’m OK with this change. I’d still like to hear from @syclik, but let’s wait at least until the regular week starts before merging.

I do like the idea of updating every N seconds rather than N iterations.
Andrew

Just narrow point since I was tagged: I’ll be able to outline my bit this weekend if not before. I think we should be able to specify what’s compatible and what’s not for minor version bumps w.r.t. output.

Thanks for putting in the time and effort, @bgoodri and @Bob_Carpenter.

These are my opinions:

  • The co-opting of the refresh argument is a change to backwards compatibility (I think by construction); what @Bob_Carpenter suggested in the PR for actions to take to make it backwards compatible would change this conversation.
  • We don’t have to maintain that behavior of printing every X iterations. Things that are sent to the logger should be allowed to change occasionally. (@bgoodri – RStan is the worst offender fixing Stan behavior by relying on the log and parsing it… if we’re able to change things around, I’d like to change a few more things that will break RStan; should we just expect RStan to keep up?)
  • We shouldn’t be switching this type of behavior often. Are we sure this is the way we want to do it? Are there other alternatives? (Time + iterations? Some other format of printing?) This is the time to actually think about this. We left it the way it was due to maintaining behavior
  • Personal request: can you keep snark down on the PR? If you can’t see it from the PR, I can highlight it directly. I recognize it as a joke, but it wouldn’t really be clear to those just stumbling into the repo.

My suggestion:

  • I have a slight preference for @Bob_Carpenter’s suggestion of maintaining backwards compatibility, but I’m ok with the pull request, if it works.
  • I would really like a couple things:
    • documentation. Please document the meaning of refresh.
    • tests. Please add a unit test that actually triggers the refresh. And inspect that – please make sure that outputs what you think it does. If that means designing a function so it can be tested, that’d be preferable. For example, instead of what you have going on, I could imagine changing the write_timing() function to accept the elapsed time and have that be tested separately. I couldn’t see any tests that actually triggered the writer; just changes to existing tests to make sure it didn’t trigger the output.
    • is there any other information we want at these times?
    • is there a reason refresh is an integer? Is that seconds?

This isn’t in Stan; this is in RStan! I fixed it a while ago and my change was pulled out for some reason, so I thought it was done intentionally. All you have to do is construct the writers with a prefix. It’s an easy fix. (It should probably be tested so it doesn’t disappear again.)

Can everyone stop talking about backwards compatibility as if it were an end unto itself and start talking about the cost placed on users due to changes? Microsoft, Oracle, etc. do not preserve backward compatibility for the sake of purity. They are two of the least pure companies out there. They strive to maintain backward compatibility because they have many corporate customers that often have in-house code that was written by long since departed developers. The cost of that custom code breaking or otherwise having to retrain their employees is much larger than the expected benefit of new features. Thus, corporate customers would not upgrade to the latest thing Microsoft, Oracle, etc. were selling. None of that is an issue in this situation. Microsoft, Oracle, etc. change the appearance of things all the time because that does not usually break custom code (that is using an API properly) or require retraining employees.

What is the cost to users of reinterpreting refresh in seconds rather than in iterations? No one besides me is addressing this question, but so far, I have come up with

  1. It would break any external monitoring tools that assume a fixed number of progress reports. As far as we know, no such monitoring tools exist. This is in part because we don’t really expose it as part of an API, no one cares about the progress reports enough to write something like that, and because sane monitoring tools such as the watch program poll based on time elapsed or modification time to a file.
  2. It would alter the behavior but not the validity of old code that specifies the refresh argument rather than accepting its default. But that is rare and in the cases where people are specifying the refresh argument in their scripts it is usually to set it to a much higher value than the default in order to see fewer progress reports because iteration-based progress reports are not very useful and those people are happier the fewer progress reports appear on their screen.

In addition, no one has yet been able to fill in the blank of “Iteration-based progress reports are good for _____?” That is presumably because they are not useful for the primary question users are asking, which is “How much longer will this long run take to finish?”, unless those users are sitting there with a stopwatch to figure out how much time it is taking to complete refresh iterations.

What is the cost to users of adding something like a refresh_seconds argument while keeping the existing refresh argument that pertains to the iteration interval in order to preserve backward compatibility?

  1. Existing compiled models throw an error message saying “The refresh_seconds argument is not recognized”. To those users whose previously working code now does not return any draws from the posterior distribution, we are essentially saying “Tough shit. We are backwards compatible. Ask the maintainer of that code to upload a new version so that it gets recompiled or downgrade to a previous version of rstan that does not try to pass a refresh_seconds argument to your compiled code.”
  2. (Re)compiled models that are called with the default arguments are going to continue with the iteration-based progress reports that are widely disliked.
  3. Andrew is going to object to typing out refresh_seconds = 20 every time in his books because it is distracting and burdensome to readers of those books who are trying out the examples.
  4. People have to figure out the difference between refresh (iterations) and refresh_seconds and remember that the arguments are not, for example, refresh (seconds) and refresh_iterations.
  5. When we later deprecate the refresh argument, people are going to post messages asking if their results are okay because they got a deprecation message that they don’t understand. Of course, when we ultimately remove the refresh argument, old code that specifies refresh will break for no real good reason but that is considered the responsible way to break backwards compatibility because it gives customers time to mitigate the costs imposed on them.

Stan is not akin to the Linux Kernel developers’ policy of not breaking a userspace application with changes to the kernel. That kind of an absolute policy is perhaps appropriate for upstream projects that have a substantial downstream (i.e. Linux servers). There is very little downstream of the Stan Library besides end-users and what little downstream there is consists primarily of the interfaces that we maintain.

1 Like

Not sure if this is off topic, since this thread is mostly about standards, but I thought I’d point out my own implementation of a time-based update system. Demo code attached below and vid of what the output looks like here. Since I’m not modifying rstan itself, I had to hack a system where I watch the sample csv files, which you presumably wouldn’t need to do if implementing this in rstan directly, but the actual update part where I simply use \r characters and utils::flush.console() to overwrite the display on updates should be applicable and hopefully fairly cross platform?

ezStan_progress_demo.R (883 Bytes)

2 Likes

This is an important point for users. Changes have to be big improvements to be worthwhile, which is why I think this one or something like it is going to be OK. But like @syclik I don’t want to then have to change again any time soon.

I wasn’t trying to point fingers about where the tests went missing :-)

Yes, please.

I would like to second that. Let’s all play nicely, as frustrating as this whole process can be.

I agree this is the right framing of the question.

I thought we’d already concluded that the cost here wasn’t prohibitively high and we were going to classify outputs into “logging” and “inference, etc.” and allow the former to change fluidly and require backward compatibility to the extent possible for the latter.

To refresh the discussion, so to speak, the cost here for users that used refresh or paid attention to output is in realizing behavior changed, then figuring out what the new behavior is and how to adapt their scripts to work with it. For example, imagine we go forward with the proposed 30s default update. Now imagine a user with a Stan program and data that fits in 30s or 1m or 2m. Before, they’d get lots of default output and now it looks like Stan is broken because nothing’s happening for 5s, 10s, … This is roughly forever for software.

This is the problem Google has with Hangouts, etc., and a lot of the phone app space has—the button placement and menus keep changing to no end other than to make me relearn where everything is. As another anecdote, the OS-es (maybe just Windows or Mac) used to have modes that would reorder dropdown menus based on usage, but as far as I know, everyone turned it off because it was too confusing.

Iteration-based reports are useful. The relevant question is “when are they better than timed reports?”. I don’t think anyone’s standing up for iteration-based reporting per se. I think nobody would’ve said “boo” if the iteration-based approach was deprecated in favor of a timed approach without changing the meaning of API parameters.

By changing the way things behave, a second relevant questio is added, namely “is the disruption to users worth it?”.

As I said, I’m OK with this one. I’m going to be even more OK if we can triage features so we don’t have to have this discussion again.

This is an even better way to frame this.

By “existing compiled models”, you mean ones for CmdStan?

Indeed. Defaults would be best.

I’m concerned that the book they’re writing won’t run when it’s released, much less two years down the road. In my discussions with Andrew, this hasn’t been a priority. I don’t think ARM ever had working code (in the sense that you could download it from the ARM repo and run it).

Only if they want the new behavior. The upside (or downside, if you really hate iteration-based updates) is that everything still works the same as it did before if they don’t figure that out.

That’s a good argument for not doing this.

This needs to be qualified. There are a lot of packages depending on Stan and RStan. It’s just that they don’t depend on the logging behavior to keep working properly.

This would be a whole different discussion if something substantial was being changed.

Thanks, @mike-lawrence. Not off topic. It migth be the right solution to the problem Andrew cares about the most, which is having to scroll after the updates are done.

Does anyone know how to make this approach of overwriting portable in R? We used to do it, then gave up as it was breaking the GUI or Rstudio or the terminal or Windows or something like that.

On OS X, it works as expected in both the standard R GUI app and RStudio. When using terminal.app to run R, the flush causes a newline, so there’s no overwriting, but all this means is that we’re back to the old behaviour of one line getting printed per update. I’ll fire up a couple VMs and report back on Windows and Linux behaviour…