Reproducibility across machines - hardware/C++ complier dependent?

tlyim · June 26, 2019, 11:58am

I work with the same data and Stan code using CmdStan 2.19.1 on two clusters. Even with the same random seed=3412, the sampling results are not exactly the same for the mean estimates and the number of divergences.

Q: Is it necessary to have completely identical hardware and C++ complier in order to exactly replicate the sampling results ?

Results on the faster cluster:

#=====================================================================
# In folder: 20190625.130259_6083422 [75,200,10,1,**3412**,600,200,0.98,11,dense_e,4,J90c]"
#===================================================================== 
4 chains: each with iter=(200,200,200,200); warmup=(600,600,600,600); thin=(1,1,1,1); 3200 iterations saved.
Warmup took (11852, 12628, 11884, 13546) seconds, 13.9 hours total
Sampling took (2372, 3629, 3075, 3553) seconds, 3.51 hours total
                        Mean      MCSE    StdDev         5%        50%        95%     N_Eff   N_Eff/s     R_hat
lp__                1.87e+04  7.59e-01  1.33e+01   1.86e+04   1.87e+04   1.87e+04  3.08e+02  2.44e-02  1.01e+00
accept_stat__       9.74e-01  2.01e-03  5.47e-02   8.99e-01   9.90e-01   1.00e+00  7.42e+02  5.87e-02  9.99e-01
stepsize__          1.54e-02  1.38e-03  1.96e-03   1.32e-02   1.64e-02   1.81e-02  2.01e+00  1.59e-04  2.83e+14
treedepth__         9.63e+00  2.76e-01  4.87e-01   9.00e+00   1.00e+01   1.00e+01  3.11e+00  2.47e-04  1.68e+00
n_leapfrog__        8.87e+02  1.10e+02  2.47e+02   5.11e+02   1.02e+03   1.02e+03  5.07e+00  4.01e-04  1.31e+00
divergent__         2.50e-03      -nan  5.00e-02   0.00e+00   0.00e+00   0.00e+00      -nan      -nan  9.99e-01
energy__           -1.85e+04  1.10e+00  1.81e+01  -1.85e+04  -1.85e+04  -1.85e+04  2.72e+02  2.15e-02  1.02e+00
sd_y                8.01e-02  2.22e-05  6.05e-04   7.91e-02   8.00e-02   8.11e-02  7.40e+02  5.86e-02  9.99e-01
mu_u1               9.26e-02  2.98e-04  9.62e-03   7.68e-02   9.31e-02   1.09e-01  1.04e+03  8.24e-02  9.97e-01
mu_alpha            3.89e-02  7.01e-05  2.01e-03   3.57e-02   3.88e-02   4.23e-02  8.23e+02  6.52e-02  1.00e+00
beta                6.04e-01  2.72e-04  7.55e-03   5.91e-01   6.04e-01   6.17e-01  7.71e+02  6.11e-02  1.00e+00
theta               1.47e-01  1.16e-04  3.23e-03   1.42e-01   1.47e-01   1.52e-01  7.75e+02  6.13e-02  1.00e+00
sd_season           9.64e-02  1.41e-04  4.21e-03   8.93e-02   9.64e-02   1.03e-01  8.96e+02  7.09e-02  1.01e+00
mu_season[1]       -1.15e-01  3.02e-04  1.04e-02  -1.32e-01  -1.15e-01  -9.82e-02  1.18e+03  9.34e-02  9.96e-01
mu_season[2]       -6.00e-02  3.14e-04  1.05e-02  -7.72e-02  -5.97e-02  -4.28e-02  1.11e+03  8.83e-02  9.99e-01
mu_season[3]        1.39e-01  2.92e-04  1.03e-02   1.22e-01   1.39e-01   1.57e-01  1.24e+03  9.84e-02  1.00e+00
p[1]                6.63e-01  1.73e-03  4.72e-02   6.05e-01   6.51e-01   7.52e-01  7.42e+02  5.88e-02  1.00e+00
p[2]                6.21e-01  2.03e-04  5.94e-03   6.11e-01   6.21e-01   6.30e-01  8.55e+02  6.77e-02  9.99e-01
p[3]                6.28e-01  2.49e-03  6.78e-02   5.45e-01   6.10e-01   7.55e-01  7.41e+02  5.87e-02  1.00e+00
g[1]                8.41e-01  1.74e-03  4.87e-02   7.64e-01   8.41e-01   9.25e-01  7.84e+02  6.20e-02  9.98e-01
g[2]                3.50e-01  7.93e-04  2.27e-02   3.13e-01   3.50e-01   3.87e-01  8.22e+02  6.51e-02  1.00e+00
w[1]                7.71e-01  2.50e-03  6.83e-02   6.56e-01   7.73e-01   8.81e-01  7.46e+02  5.91e-02  9.99e-01
w[2]                1.42e-01  4.38e-04  1.25e-02   1.21e-01   1.42e-01   1.62e-01  8.10e+02  6.41e-02  9.98e-01
w[3]                6.36e-01  7.88e-04  2.24e-02   5.98e-01   6.36e-01   6.73e-01  8.11e+02  6.42e-02  1.00e+00
d[1]                6.83e-02  4.59e-04  1.15e-02   4.93e-02   6.84e-02   8.80e-02  6.23e+02  4.93e-02  1.01e+00
d[2]                6.93e-01  3.66e-04  1.03e-02   6.76e-01   6.93e-01   7.11e-01  7.85e+02  6.22e-02  9.97e-01
d[3]                2.39e-01  4.45e-04  1.15e-02   2.20e-01   2.39e-01   2.58e-01  6.70e+02  5.31e-02  9.99e-01
Warning: non-fatal error reading adapation data
Warning: non-fatal error reading adapation data
Warning: non-fatal error reading adapation data
Warning: non-fatal error reading adapation data
2 of 3200 (0.062%) transitions ended with a divergence.  These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.  Try rerunning with adapt delta set to a larger value and see if the divergences vanish.  If increasing adapt delta towards 1 does not remove the divergences then you will likely need to reparameterize your model.

Results on the slower cluster:

#========================================================================= 
# In folder: 20190625.193309_34194.solon [75,200,10,1,**3412**,600,200,0.98,11,dense_e,4,J90c]"
#=========================================================================
4 chains: each with iter=(200,200,200,200); warmup=(600,600,600,600); thin=(1,1,1,1); 3200 iterations saved.
Warmup took (20245, 21852, 23224, 24925) seconds, 25.1 hours total
Sampling took (3761, 6639, 6824, 6219) seconds, 6.51 hours total
                        Mean      MCSE    StdDev         5%        50%        95%     N_Eff   N_Eff/s     R_hat
lp__                1.87e+04  7.03e-01  1.22e+01   1.86e+04   1.87e+04   1.87e+04  3.02e+02  1.29e-02  1.01e+00
accept_stat__       9.65e-01  1.37e-02  1.13e-01   8.96e-01   9.91e-01   1.00e+00  6.80e+01  2.90e-03  1.04e+00
stepsize__          1.53e-02      -nan  2.26e-03   1.34e-02   1.54e-02   1.89e-02      -nan      -nan  3.54e+14
treedepth__         9.65e+00  3.22e-01  6.28e-01   9.00e+00   1.00e+01   1.00e+01  3.80e+00  1.62e-04  1.47e+00
n_leapfrog__        8.93e+02  1.37e+02  2.49e+02   5.11e+02   1.02e+03   1.02e+03  3.31e+00  1.41e-04  1.60e+00
divergent__         5.00e-03      -nan  7.06e-02   0.00e+00   0.00e+00   0.00e+00      -nan      -nan  1.02e+00
energy__           -1.85e+04  1.02e+00  1.75e+01  -1.85e+04  -1.85e+04  -1.85e+04  2.94e+02  1.25e-02  1.00e+00
sd_y                8.00e-02  2.32e-05  6.04e-04   7.91e-02   8.00e-02   8.12e-02  6.75e+02  2.88e-02  1.00e+00
mu_u1               9.22e-02  3.21e-04  9.02e-03   7.65e-02   9.26e-02   1.06e-01  7.90e+02  3.37e-02  1.00e+00
mu_alpha            3.88e-02  7.23e-05  2.02e-03   3.53e-02   3.89e-02   4.20e-02  7.84e+02  3.34e-02  1.00e+00
beta                6.04e-01  2.66e-04  7.59e-03   5.92e-01   6.04e-01   6.17e-01  8.12e+02  3.46e-02  9.96e-01
theta               1.47e-01  1.24e-04  3.20e-03   1.41e-01   1.47e-01   1.52e-01  6.68e+02  2.85e-02  9.96e-01
sd_season           9.64e-02  1.59e-04  4.12e-03   8.96e-02   9.63e-02   1.03e-01  6.69e+02  2.86e-02  1.00e+00
mu_season[1]       -1.15e-01  3.26e-04  1.01e-02  -1.31e-01  -1.15e-01  -9.95e-02  9.56e+02  4.08e-02  9.97e-01
mu_season[2]       -6.03e-02  3.31e-04  9.65e-03  -7.60e-02  -6.03e-02  -4.50e-02  8.50e+02  3.62e-02  9.98e-01
mu_season[3]        1.39e-01  3.88e-04  1.01e-02   1.22e-01   1.39e-01   1.55e-01  6.83e+02  2.91e-02  1.00e+00
p[1]                6.63e-01  1.84e-03  4.84e-02   6.03e-01   6.54e-01   7.57e-01  6.96e+02  2.97e-02  1.00e+00
p[2]                6.21e-01  1.90e-04  5.53e-03   6.12e-01   6.21e-01   6.31e-01  8.49e+02  3.62e-02  9.98e-01
p[3]                6.27e-01  2.61e-03  6.93e-02   5.41e-01   6.15e-01   7.61e-01  7.06e+02  3.01e-02  1.00e+00
g[1]                8.41e-01  1.68e-03  4.63e-02   7.64e-01   8.41e-01   9.20e-01  7.56e+02  3.23e-02  1.00e+00
g[2]                3.49e-01  7.99e-04  2.19e-02   3.14e-01   3.49e-01   3.86e-01  7.54e+02  3.21e-02  9.99e-01
w[1]                7.69e-01  2.50e-03  6.60e-02   6.59e-01   7.72e-01   8.74e-01  6.99e+02  2.98e-02  1.00e+00
w[2]                1.42e-01  4.30e-04  1.18e-02   1.23e-01   1.42e-01   1.61e-01  7.56e+02  3.22e-02  9.99e-01
w[3]                6.35e-01  7.87e-04  2.16e-02   5.99e-01   6.35e-01   6.70e-01  7.55e+02  3.22e-02  1.00e+00
d[1]                6.82e-02  4.20e-04  1.16e-02   5.00e-02   6.85e-02   8.65e-02  7.60e+02  3.24e-02  1.00e+00
d[2]                6.92e-01  3.53e-04  9.73e-03   6.75e-01   6.93e-01   7.08e-01  7.61e+02  3.25e-02  1.00e+00
d[3]                2.40e-01  4.02e-04  1.18e-02   2.21e-01   2.39e-01   2.59e-01  8.63e+02  3.68e-02  9.97e-01
Warning: non-fatal error reading adapation data
Warning: non-fatal error reading adapation data
Warning: non-fatal error reading adapation data
Warning: non-fatal error reading adapation data
4 of 3200 (0.12%) transitions ended with a divergence.  These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.  Try rerunning with adapt delta set to a larger value and see if the divergences vanish.  If increasing adapt delta towards 1 does not remove the divergences then you will likely need to reparameterize your model.

bgoodri · June 26, 2019, 1:53pm

Completely identical C++ compiler and flags at a minimum. And if you have -march=native then identical hardware, but otherwise I don’t think identical hardware is necessarily required.

syclik · June 29, 2019, 12:22am

I’ll second that. I’d start by checking compiler versions. (Assuming you have the same version of the source.)

One thing that may be tricky: different versions of the boost library may generate random numbers differently. So… if your build process somehow links to the system boost that’s installed, it might bypass the included source. You’d really have to check the doc on your compiler to know if it links to the local libs before source libs. I’ve seen that happen before and it threw me off.

Topic		Replies	Views
Same model, same data, same seed, different computers, different number of divergences General	2	767	September 6, 2023
How much precision should we expect in our fits? Developers	26	1182	June 23, 2019
Stan on computing cluster: strange results CmdStan	11	1667	June 8, 2018
Question about the Reproducibility of Stan Results Algorithms cmdstan , cmdstanr	6	1751	January 10, 2022
Cannot reproduce rstan::optimizing() and rstan::sampling() results with the same seed for my paper Interfaces rstan	0	272	June 3, 2023

Reproducibility across machines - hardware/C++ complier dependent?

Related topics