What could lead RNG diverge in two processes?

#1

I was debugging a parallel implementation and found out that the BaseRNG diverged in two processes that are initiated with a same seed, as shown below the random number generated in base_nutstransition() function call

[0] Iteration:  11 / 210 [  5%]  (Sampling)
[1] Iteration:  11 / 210 [  5%]  (Sampling)
[0] this->rand_uniform_: 0.331977
[1] this->rand_uniform_: 0.331977
[0] this->rand_uniform_: 0.825977
[1] this->rand_uniform_: 0.825977
[0] this->rand_uniform_: 0.539221
[1] this->rand_uniform_: 0.539221
[0] this->rand_uniform_: 0.862824
[1] this->rand_uniform_: 0.115614
[0] this->rand_uniform_: 0.237655
[1] this->rand_uniform_: 0.977498
[0] this->rand_uniform_: 0.0929974
[1] this->rand_uniform_: 0.302045

[0] indicates process 1 and [1] indicates process 2, between which there are some point-to-point communications. The only possibility I can come up with is that one of the two made an RNG call without the other doing the same. Could there be any other cause?

0 Likes

#3

Did you figure this out?

No, there shouldn’t be a reason for this to drift… but, are you running on the same architecture with the exact same compiler flags?

0 Likes

#5

The two processes are on different cores of a same machine. The only reason I can think of is that the rng are not called consistently. I’m running a debug build to see if I’ve missed anything. To be continued.

0 Likes

#6

Debugger found the cause to be that the two processes step into different branch of log_sum_weight_subtree > log_sum_weight check, as they are not keeping their trees in consistency, a bug that I’ve introduced. This further causes uneven calls to rng, hence the diverge.

0 Likes

#7

The root of the issue is filed here as a design flaw.

1 Like

#8

Cool. Thank you for filing that issue. Think you could help with a fix?

0 Likes

#9

Fix is one line, I’ll update the issue later today. I was actually surprised the original issue is still open.

1 Like

#10

Thanks much.

0 Likes

#11

I’m closing this thread and moving to the issue for further discussion if needed.

0 Likes

closed #12
0 Likes