I think the biggest thing needed there is to test that if you run multiple chains with same seed and different chain ID, then the transformed data with rngs is the same across chains.
This change intentionally broke the seed behavior on chains. And it will also break any exact tests that were based on seed.
Having tests that are based on exact RNG seeds replicating across releases seems too strict to me. Or, as @seantalts suggested, we need a way to generate new expected behavior, in which case, all the test does is provide a flag that something's changed. I think that may be what @syclik wants, but I'm not sure.
As far as Stan goes, there's no reason to keep things seed-to-seed compatible across releases. We just need to preserve the statistical properties, which are far more subtle and require something like @betanalpha's testing framework.