Stan behavior changed in the last few days

Did anyone merge anything into Stan that knowingly changed behavior?

It started here (see the change log):
http://d1m1s1b1.stat.columbia.edu:8080/job/Daily%20Stan%20Performance/887/

I’m looking here at the recently closed pull requests:

@mitzimorris does the rng pull request make Stan run differently?

does the rng pull request make Stan run differently?

that PR changes the generated model’s 3-arg constructor so that it takes an unsigned int seed instead of an instantiation of an RNG.

there are issues filed on the interfaces.
I have a branch for cmd stan that fixes this.

@mitzimorris I don’t think I have any more clarity. One of the pull requests that was merged changes behavior in Stan. Something makes it not reproducible between a few days ago and now.

This is before it gets out to the interfaces. This is just within Stan.

Doesn’t the RNG PR make the seeds for the chains be off by one from before? Or are you saying that currently it’s not run-to-run rep;roducible?

It is run-to-run reproducible. It’s not reproducible between commit b5f6457 and what it is today ee8f23e.

I believe I found the change. The stride is different.

And that changes everything. I’m ok with change, but we should really know that it happens and have justification why it happens. Right now, things like this are caught by the performance test, which maybe we should call a regression test or something like that.

@seantalts, any reason it’s been disabled on Jenkins? It would have signaled that Stan is changing behavior. That’s something we should know. It doesn’t happen often, but when it does, we should really know why we’re doing it.

correct, stride is different.
the reason for this is that RNG in transformed data will have stride “0” of the seed.
the chains get stride corresponding to their id - was corresponding to (id - 1).

I didn’t realize this was part of any test.

sorry, there should have been more unit tests for creating the rng. Regarding the performance test: don’t worry about it. We don’t have real regression tests. It would have been really helpful if it was still active in Jenkins for this reason. Just as a sanity check until we have regression tests available.

New issue created:

syclik http://discourse.mc-stan.org/u/syclik Developer
June 17

sakrejda:

Doesn’t the RNG PR make the seeds for the chains be off by one from
before? Or are you saying that currently it’s not run-to-run rep;roducible?

It is run-to-run reproducible. It’s not reproducible between commit
b5f6457 and what it is today ee8f23e.

I believe I found the change. The stride is different.

That’s what I was trying to tell you.

Sorry, didn’t realize you were trying to tell me that with the off by one comment. When I first read that, I thought there was one more or one less call to the rng, not that that the stride was changed.

No problem, just trying to save you some time.

1 Like

do we need more documentation beyond the current doc for stan/services/util/create_rng.hpp ?

 * Creates a pseudo random number generator from a random seed
       * and a chain id by initializing the PRNG with the seed and
       * then advancing past pow(2, 50) times the chain ID draws to
       * ensure different chains sample from different segments of the
       * pseudo random number sequence.
       *
       * Chain IDs should be kept to larger values than one to ensure
       * that the draws used to initialized transformed data are not
       * duplicated.

@syclik regarding the performance test that was part of the Stan PR build - it had been failing for a while and I think you suggested just disabling it, so I did. I left the daily performance test intact, figuring that would be enough to figure things out with our relatively low volume of pull requests. I will make it email me on a failure, since I hadn’t noticed it start failing.

I see you have a PR to update the “golden master” for the performance test after Mitzi’s RNG update (thanks!), and that Stan pull requests are now running the performance tests (that was you, right?). Is there other stuff going on or anything you think I should look at? Just trying to catch up on this topic.

Yeah. I reenabled it! It was a failure with the config, so I adjusted it to clean out the workspace before builds.

I think it’s the wrong test to be running to catch these sorts of failures, but we don’t have anything else in place. We used to run the BUGS examples years ago and we ran into other problems there too.

Either way, we just need to know when we’re changing behavior.

1 Like

it would be nice to have a definition of what is/isn’t expected behavior - from there we could put in place some kind of unit tests.

+1. I’ll try to formalize it.

I think the biggest thing needed there is to test that if you run multiple chains with same seed and different chain ID, then the transformed data with rngs is the same across chains.

This change intentionally broke the seed behavior on chains. And it will also break any exact tests that were based on seed.

Having tests that are based on exact RNG seeds replicating across releases seems too strict to me. Or, as @seantalts suggested, we need a way to generate new expected behavior, in which case, all the test does is provide a flag that something’s changed. I think that may be what @syclik wants, but I’m not sure.

As far as Stan goes, there’s no reason to keep things seed-to-seed compatible across releases. We just need to preserve the statistical properties, which are far more subtle and require something like @betanalpha’s testing framework.

That would be great!

We’re in agreeement.

I’m looking at this from the point of view of getting us to have faster development with more contributors. We can’t get there if things break between pull request merges. Right now, we have unit tests, but no real integration / regression tests that would let us just move forward with a little more ease.