Because it’s sampling based estimation. For example, if I take 1000 draws from a standard normal distribution, I don’t expect them all to have a mean of 0.00000. Instead, it looks like this:
> for (n in 1:10) print(mean(rnorm(1000)))
[1] -0.008647283
[1] 0.004369155
[1] -0.02758793
[1] -0.07020262
[1] -0.03062315
[1] 0.02008949
[1] 0.009072343
[1] 0.04534396
[1] -0.02572225
[1] 0.02673742
And there’s not even convergence here—this is just 1000 independent draws.
What you want to look at in terms of the Stan output is the MCMC standard error. That tells you how accurate you can expect your mean estimates to be. It’s defined as the posterior standard deviation divided by the square root of the effective sample size. For example, in the 1000 draws from the standard normal, the sd is 1, and the ESS is 1000, so we expect errors distributed roughly as normal(0, 1/sqrt(1000))
. For example,
> for (n in 1:10) print(rnorm(1, 0, 1/sqrt(1000)))
[1] 0.02822878
[1] 0.03832842
[1] 0.02723462
[1] 0.03650974
[1] -0.007586212
[1] -0.03132589
[1] -5.446237e-05
[1] -0.006155
[1] 0.04067246
[1] -0.04534127
With more draws, they get closer together. For instance, if I take 100,000 draws from a standard normal, the error is 1/10th what it is with 1000 draws,
> for (n in 1:10) print(mean(rnorm(1e5)))
[1] 0.005991892
[1] 0.003730479
[1] 0.003209092
[1] -0.006674568
[1] -0.0009652304
[1] 0.001986868
[1] 0.003473231
[1] 0.0003379722
[1] -0.002987224
[1] 0.001434846
Basically, you need a factor of N^2 as many draws to reduce error by a factor of 1/N.
Welcome to Monte Carlo-based estimation.
[edit: first used 1e6 instead of 1e5 in final example, so edited to match the text]