I recently updated rstan to 2.21.1 from 2.19.3 which means that I’m using stan 2.21.0 instead of 2.91.1. I noticed a number of changes in my outputs:
The differences in parameter means seem to be equivalent to changing the seed, i.e., it seems like a different sample was generated, not necessarily better or worse. Is this expected from the update? I see from the release notes that TBB is now used. Could that be the cause? I notice a difference even with 1 chain (there is not always a difference).
The effective sample sizes and rhats are a lot worse for some parameters (e.g. ESS going from 47 to 9). It doesn’t seem like it is simply due to a different sample being taken (i.e. point 1 above), as we would observe a roughly even number of small increases and decreases. Have there been changes to the ESS calculation that could have caused this?
No. The sampler algorithm changed. I don’t understand why you think the TBB - aka Threading Building Blocks - would change sampling in any way. The TBB is only there to allow seamless execution using threads. Not more.
OK. After further investigation with different seeds it seems that while there is a difference in the samples, the ESS and Rhat are not consistently worse than before.
@wds15 Can you please point to the relevant entry in the release notes, section of the documentation, discourse thread, or relevant GitHub commits that discuss the sampler change? Your update to rstan caused our production models to start spitting out different answers, so we went to the release notes https://github.com/stan-dev/stan/releases/tag/v2.21.0. I don’t see any mention of a sampler change. The TBB entry was just the best guess what caused the diffs.
The 2.19.x -> 2.21.x transition accumulated six months of changes in the libraries and more than a year’s worth of changes in the rstan interface, although I doubt TBB makes much difference unless your models were using map_rect. Also, the way in which the compilation works is very different now. The slightest difference in the binary will result in different draws than before, so that is not a surprise to me. And there are going to be even more noticeable changes in the 2.21.x -> 2.24.x transition.
If you think the draws have a different distribution than before, that is more of a concern. Unfortunately, since both RStan and PyStan were stuck of 2.19.x for a long time, now is the first opportunity for the 2.21 release to be widely stressed.
Thanks for the additional info, all! We’re satisfied that the draws appear to have the same distribution as before and that the changes we’re seeing are the same differences we’d see when changing the RNG seed, which is in line with @bgoodri’s comment that any slight change to the binary will result in different draws. My team always need to investigate and notify clients any time we lose exact reproducibility. I wouldn’t have thought a change titled “Add additional no-u-turn checks” would have this effect.
I just assume that everything will affect the values of the realized draws but if we think a change will affect the distribution of the realized draws, then it will be prominently advertised. For something like an additional no-u-turn check, all it takes is for that check to return true on one iteration and the whole rest of the chain will have different realizations. But really all it takes is for some calculation to return a value that differs in the 15th decimal place from what it was before and that can cause the chain to go somewhere different.