Different sampling results for identical model when using same data and seed


#1

I expected that minor modifications of how a model is implemented in Stan should not influence sampling results, but it seems to me that they do.

I don’t think that I am making an obvious mistake, i.e. I am using the same data for the models and am setting the seed when sampling from the model in rstan.

The two models are identical, with exception of a few lines in the model block, where I replace a long vectors with repetitions of a few parameters with a short vector where each parameter appears only once. This code is from a hierarchical model, where the over-dispersion parameter varies between groups. brms implements such a model by generating for the over-dispersion parameters a vector phi with as many entries as there are observations (N)
Here is the relevant section of the original model:

vector[N] phi = temp_phi_Intercept + rep_vector(0, N);
for (n in 1:N) { 
    phi[n] += r_2_phi_1[J_2[n]] * Z_2_phi_1[n];
    phi[n] = exp(phi[n]); 
} 
for (n in 1:N) {
    target += beta_binomial2_lpmf(Y[n] | mu[n], phi[n], trials[n]);
}

Because the number of groups (N_1) is much smaller than the number of observations (N) and we have a vector J_1 which indicates to which group each observation belongs, the same model can also be implemented as

vector[N_1] phi = temp_phi_Intercept + rep_vector(0, N_1);
for (n in 1:N_1) { 
    phi[n] += r_2_phi_1[n] * Z_2_phi_1[n];
    phi[n] = exp(phi[n]); 
} 
for (n in 1:N) {
    target += beta_binomial2_lpmf(Y[n] | mu[n], phi[J_1[n]], trials[n]);
}

So the only difference between the original and the modified version is how the N_1 over-dispersion parameters are stored in the vector phi.

I would have expected that ampling from the original and the modified model using the same data and seed, one should get identical results (i.e. identical chains), but I get different chains (very close, but not the same).

Should one not expect identical results from these two models?


#2

No. Recall that floating point arithmetic on a computer is not even associative. You change how things are being accumulated here and that leads to minor numerical differences which do get emphasized a lot by the NUTS sampler.


#3

Thanks,
identical results would have been convenient to easily check that I don’t mess up a model when I try to optimize it.
But I can still check if the results from the modified model are within one mcse of the original model.


#4

Yes, the only consistency guaranteed is that the expectation (mean) estimates are with the relative Monte Carlo standard errors of each other. In particular,

\frac{ \hat{f}_{1} - \hat{f}_{2} }{ \sqrt{ \text{MCSE}_{1}^{2} + \text{MCSE}_{2}^{2} } } \sim \mathcal{N}(0, 1)

at least approximately.

So you can compute the left hand side and require that it be within +/-2 as a quick check, or to be more thorough you can run both models with multiple seeds and histogram the corresponding scaled differences to get a feel for the distributional behavior.