Different sampling results for identical model when using same data and seed

I expected that minor modifications of how a model is implemented in Stan should not influence sampling results, but it seems to me that they do.

I don’t think that I am making an obvious mistake, i.e. I am using the same data for the models and am setting the seed when sampling from the model in rstan.

The two models are identical, with exception of a few lines in the model block, where I replace a long vectors with repetitions of a few parameters with a short vector where each parameter appears only once. This code is from a hierarchical model, where the over-dispersion parameter varies between groups. brms implements such a model by generating for the over-dispersion parameters a vector phi with as many entries as there are observations (N)
Here is the relevant section of the original model:

vector[N] phi = temp_phi_Intercept + rep_vector(0, N);
for (n in 1:N) { 
    phi[n] += r_2_phi_1[J_2[n]] * Z_2_phi_1[n];
    phi[n] = exp(phi[n]); 
for (n in 1:N) {
    target += beta_binomial2_lpmf(Y[n] | mu[n], phi[n], trials[n]);

Because the number of groups (N_1) is much smaller than the number of observations (N) and we have a vector J_1 which indicates to which group each observation belongs, the same model can also be implemented as

vector[N_1] phi = temp_phi_Intercept + rep_vector(0, N_1);
for (n in 1:N_1) { 
    phi[n] += r_2_phi_1[n] * Z_2_phi_1[n];
    phi[n] = exp(phi[n]); 
for (n in 1:N) {
    target += beta_binomial2_lpmf(Y[n] | mu[n], phi[J_1[n]], trials[n]);

So the only difference between the original and the modified version is how the N_1 over-dispersion parameters are stored in the vector phi.

I would have expected that ampling from the original and the modified model using the same data and seed, one should get identical results (i.e. identical chains), but I get different chains (very close, but not the same).

Should one not expect identical results from these two models?

No. Recall that floating point arithmetic on a computer is not even associative. You change how things are being accumulated here and that leads to minor numerical differences which do get emphasized a lot by the NUTS sampler.

identical results would have been convenient to easily check that I don’t mess up a model when I try to optimize it.
But I can still check if the results from the modified model are within one mcse of the original model.

Yes, the only consistency guaranteed is that the expectation (mean) estimates are with the relative Monte Carlo standard errors of each other. In particular,

\frac{ \hat{f}_{1} - \hat{f}_{2} }{ \sqrt{ \text{MCSE}_{1}^{2} + \text{MCSE}_{2}^{2} } } \sim \mathcal{N}(0, 1)

at least approximately.

So you can compute the left hand side and require that it be within +/-2 as a quick check, or to be more thorough you can run both models with multiple seeds and histogram the corresponding scaled differences to get a feel for the distributional behavior.