I was debugging a parallel implementation and found out that the BaseRNG diverged in two processes that are initiated with a same seed, as shown below the random number generated in base_nuts’ transition() function call
[0] indicates process 1 and [1] indicates process 2, between which there are some point-to-point communications. The only possibility I can come up with is that one of the two made an RNG call without the other doing the same. Could there be any other cause?
The two processes are on different cores of a same machine. The only reason I can think of is that the rng are not called consistently. I’m running a debug build to see if I’ve missed anything. To be continued.
Debugger found the cause to be that the two processes step into different branch of log_sum_weight_subtree > log_sum_weight check, as they are not keeping their trees in consistency, a bug that I’ve introduced. This further causes uneven calls to rng, hence the diverge.