What could lead RNG diverge in two processes?

I was debugging a parallel implementation and found out that the BaseRNG diverged in two processes that are initiated with a same seed, as shown below the random number generated in base_nutstransition() function call

[0] Iteration:  11 / 210 [  5%]  (Sampling)
[1] Iteration:  11 / 210 [  5%]  (Sampling)
[0] this->rand_uniform_: 0.331977
[1] this->rand_uniform_: 0.331977
[0] this->rand_uniform_: 0.825977
[1] this->rand_uniform_: 0.825977
[0] this->rand_uniform_: 0.539221
[1] this->rand_uniform_: 0.539221
[0] this->rand_uniform_: 0.862824
[1] this->rand_uniform_: 0.115614
[0] this->rand_uniform_: 0.237655
[1] this->rand_uniform_: 0.977498
[0] this->rand_uniform_: 0.0929974
[1] this->rand_uniform_: 0.302045

[0] indicates process 1 and [1] indicates process 2, between which there are some point-to-point communications. The only possibility I can come up with is that one of the two made an RNG call without the other doing the same. Could there be any other cause?

Did you figure this out?

No, there shouldn’t be a reason for this to drift… but, are you running on the same architecture with the exact same compiler flags?

The two processes are on different cores of a same machine. The only reason I can think of is that the rng are not called consistently. I’m running a debug build to see if I’ve missed anything. To be continued.

Debugger found the cause to be that the two processes step into different branch of log_sum_weight_subtree > log_sum_weight check, as they are not keeping their trees in consistency, a bug that I’ve introduced. This further causes uneven calls to rng, hence the diverge.

The root of the issue is filed here as a design flaw.

1 Like

Cool. Thank you for filing that issue. Think you could help with a fix?

Fix is one line, I’ll update the issue later today. I was actually surprised the original issue is still open.

1 Like

Thanks much.

I’m closing this thread and moving to the issue for further discussion if needed.