I am currently facing an issue with the replicability of my estimation results. I initially estimated the parameters of my model on Windows OS, but I noticed slight differences when attempting to replicate the same estimation on macOS. I am using CmdStanR for these estimations, and I would appreciate any advice on ensuring consistent results across operating systems.
Are you getting different statistical results (e.g. mean estimates), or only different draws?
Exact numerical reproducibility (e.g. the same draws in the same order) across operating systems is not something Stan seeks to guarantee. It’s possible you could even observe differences on the same operating system on two different pieces of hardware.
But if you’re getting truly different posteriors, that would be interesting to explore why
If you can’t share your code, can you say what you mean by “slight differences”? Are the differences in mean estimates with MCMC standard error?
MCMC is random and results can vary across runs. Even with the same seed, floating point behaves differently on different computers, under different C++ compilers and optimization levels, etc.
Unfortunately I can’t share my code. But yes, I observe differences in the statistics of the posterior distribution of the parameters, including mean estimates.
How large are the differences, and how large are the monte carlo standard errors?
One useful check that might be clarifying is the following:
Run a sizeable number of chains (say six or ten) on each OS. Then, move the output csvs from one computer to the other. Then, load both sets of csvs on one computer, and compute convergence diagnostics 3 times:
Over the csvs from one OS
Over the csvs from the other OS
Over all the csvs combined into one object.
If R-hat and other diagnostics are not much worse in 3 than in 1 or 2, then it’s likely that nothing unexpected is happening.