Different results with Stan 2.21.0

I recently updated rstan to 2.21.1 from 2.19.3 which means that I’m using stan 2.21.0 instead of 2.91.1. I noticed a number of changes in my outputs:

  1. The differences in parameter means seem to be equivalent to changing the seed, i.e., it seems like a different sample was generated, not necessarily better or worse. Is this expected from the update? I see from the release notes that TBB is now used. Could that be the cause? I notice a difference even with 1 chain (there is not always a difference).

  2. The effective sample sizes and rhats are a lot worse for some parameters (e.g. ESS going from 47 to 9). It doesn’t seem like it is simply due to a different sample being taken (i.e. point 1 above), as we would observe a roughly even number of small increases and decreases. Have there been changes to the ESS calculation that could have caused this?

The sampler has been optimized in between releases.

The ess calculations are also changed as I recall.

Thanks. So the differences sampling are to be expected due to optimization of the sampler? Is this the TBB change?

Do you happen to know what changes were made to the ESS calculations? Or which pull request or commit contained the changes?

There’s links to the docs for the new Rhat is here: New R-hat and ESS . It’s more conservative (picks up some problems the old Rhat didn’t).

Thanks, so these changes to Rhat are in stan 2.21.0?

2.21 is from October 2019 and that post is from March 2019, so my guess is yes but that isn’t 100%.

You could compute your Rhats with this package: https://github.com/stan-dev/posterior which definitely has the new stuff and see if it matches.

No. The sampler algorithm changed. I don’t understand why you think the TBB - aka Threading Building Blocks - would change sampling in any way. The TBB is only there to allow seamless execution using threads. Not more.

OK. After further investigation with different seeds it seems that while there is a difference in the samples, the ESS and Rhat are not consistently worse than before.

1 Like

@wds15 Can you please point to the relevant entry in the release notes, section of the documentation, discourse thread, or relevant GitHub commits that discuss the sampler change? Your update to rstan caused our production models to start spitting out different answers, so we went to the release notes https://github.com/stan-dev/stan/releases/tag/v2.21.0. I don’t see any mention of a sampler change. The TBB entry was just the best guess what caused the diffs.

The 2.19.x -> 2.21.x transition accumulated six months of changes in the libraries and more than a year’s worth of changes in the rstan interface, although I doubt TBB makes much difference unless your models were using map_rect. Also, the way in which the compilation works is very different now. The slightest difference in the binary will result in different draws than before, so that is not a surprise to me. And there are going to be even more noticeable changes in the 2.21.x -> 2.24.x transition.

If you think the draws have a different distribution than before, that is more of a concern. Unfortunately, since both RStan and PyStan were stuck of 2.19.x for a long time, now is the first opportunity for the 2.21 release to be widely stressed.

1 Like

I think this went into Stan, but did not make it into the release notes (where it should have been).

Here is the respective git commit (which had a small bug which got fixed later)

here is the PR:

and here is the long thread leading to this change

our update to rstan caused our production models to start spitting out different answers,

whow! Like the models give quite different answers? More details would be interesting… and you should ping people from the discourse thread I linked.

1 Like

Isnt this the fourth bullet point in the “new features” section of the release notes? It was not super advertised, that is true.

Whoops! I read over it… but you are right… that is what referred to it. My bad.

Thanks for the additional info, all! We’re satisfied that the draws appear to have the same distribution as before and that the changes we’re seeing are the same differences we’d see when changing the RNG seed, which is in line with @bgoodri’s comment that any slight change to the binary will result in different draws. My team always need to investigate and notify clients any time we lose exact reproducibility. I wouldn’t have thought a change titled “Add additional no-u-turn checks” would have this effect.

I just assume that everything will affect the values of the realized draws but if we think a change will affect the distribution of the realized draws, then it will be prominently advertised. For something like an additional no-u-turn check, all it takes is for that check to return true on one iteration and the whole rest of the chain will have different realizations. But really all it takes is for some calculation to return a value that differs in the 15th decimal place from what it was before and that can cause the chain to go somewhere different.

A version change will probably always imply that you loose exact reproducibility. Finding out why each time is quite some call.

1 Like