@mitzimorris I don’t think I have any more clarity. One of the pull requests that was merged changes behavior in Stan. Something makes it not reproducible between a few days ago and now.
This is before it gets out to the interfaces. This is just within Stan.
And that changes everything. I’m ok with change, but we should really know that it happens and have justification why it happens. Right now, things like this are caught by the performance test, which maybe we should call a regression test or something like that.
@seantalts, any reason it’s been disabled on Jenkins? It would have signaled that Stan is changing behavior. That’s something we should know. It doesn’t happen often, but when it does, we should really know why we’re doing it.
correct, stride is different.
the reason for this is that RNG in transformed data will have stride “0” of the seed.
the chains get stride corresponding to their id - was corresponding to (id - 1).
sorry, there should have been more unit tests for creating the rng. Regarding the performance test: don’t worry about it. We don’t have real regression tests. It would have been really helpful if it was still active in Jenkins for this reason. Just as a sanity check until we have regression tests available.
Sorry, didn’t realize you were trying to tell me that with the off by one comment. When I first read that, I thought there was one more or one less call to the rng, not that that the stride was changed.
do we need more documentation beyond the current doc for stan/services/util/create_rng.hpp ?
* Creates a pseudo random number generator from a random seed
* and a chain id by initializing the PRNG with the seed and
* then advancing past pow(2, 50) times the chain ID draws to
* ensure different chains sample from different segments of the
* pseudo random number sequence.
*
* Chain IDs should be kept to larger values than one to ensure
* that the draws used to initialized transformed data are not
* duplicated.
@syclik regarding the performance test that was part of the Stan PR build - it had been failing for a while and I think you suggested just disabling it, so I did. I left the daily performance test intact, figuring that would be enough to figure things out with our relatively low volume of pull requests. I will make it email me on a failure, since I hadn’t noticed it start failing.
I see you have a PR to update the “golden master” for the performance test after Mitzi’s RNG update (thanks!), and that Stan pull requests are now running the performance tests (that was you, right?). Is there other stuff going on or anything you think I should look at? Just trying to catch up on this topic.
Yeah. I reenabled it! It was a failure with the config, so I adjusted it to clean out the workspace before builds.
I think it’s the wrong test to be running to catch these sorts of failures, but we don’t have anything else in place. We used to run the BUGS examples years ago and we ran into other problems there too.
Either way, we just need to know when we’re changing behavior.
I think the biggest thing needed there is to test that if you run multiple chains with same seed and different chain ID, then the transformed data with rngs is the same across chains.
This change intentionally broke the seed behavior on chains. And it will also break any exact tests that were based on seed.
Having tests that are based on exact RNG seeds replicating across releases seems too strict to me. Or, as @seantalts suggested, we need a way to generate new expected behavior, in which case, all the test does is provide a flag that something’s changed. I think that may be what @syclik wants, but I’m not sure.
As far as Stan goes, there’s no reason to keep things seed-to-seed compatible across releases. We just need to preserve the statistical properties, which are far more subtle and require something like @betanalpha’s testing framework.
I’m looking at this from the point of view of getting us to have faster development with more contributors. We can’t get there if things break between pull request merges. Right now, we have unit tests, but no real integration / regression tests that would let us just move forward with a little more ease.