Just wanted to emphasize a few points that @maxbiostat and @jsocolar have made.
In Stan each Markov chain is initialized and effectively evolves independently* from all other chains.
This means that increasing the number of Markov chains doesn’t change the behavior of any previously generated Markov chains**. Consequently the behavior encountered here almost always indicates that the new Markov chains were able to explore parts of the model configuration space that the previous Markov chains had largely ignored. In this case the only way to move forwards is to investigate this new exploration, identify if it’s meaningful or pathological, and then followup based on whether or not you want to incorporate this exploration into your posterior (i.e. if it’s meaningful behavior) or exclude it (i.e if it’s pathological and inconsistent with domain expertise).
* Pedantry: Stan achieves this independence by partitioning the states of a pseudo-random number generator as discussed for example in Rumble in the Ensemble. Technically if one uses a lot of very long Markov chain then the pseudo-random number generate states might start to overlap and induce subtle correlations between the individual Markov chains but this is a very extreme circumstance.
** More Pedantry: If you fix the seed
argument and run on the same machine twice, once with n_chains=4
and once with n_chains=8
, then the the first four Markov chains will be identical. If you’re not fixing the seed
argument then running with an increased n_chains
will not preserve any of the previously generated Markov chains. In this case running again with a larger n_chains
argument effectively generates an entirely new set of Markov chains which may or may not have an opportunity to explore behavior not encountered by any previous Markov chains.
Effective sample size is a quantity defined in the context of a Markov chain Monte Carlo central limit theorem. If such a central limit theorem doesn’t hold for an expectand then there is no corresponding effective sample size. In this case one can still construct effective sample size estimators, which is what Stan does, but those estimators don’t correspond to any meaningful property. The split \hat{R} diagnostic is useful precisely because it is sensitive to failures of a Markov chain Monte Carlo central limit theorem; if \hat{R} \gg 1 is inconsistent with a central limit theorem holding in which case the effective sample size estimators are meaningless.
In other words the key issues here isn’t the effective sample size estimator plummeting as more Markov chains are added but rather \hat{R} suddenly increasing from 1. Once you see that the effective sample size estimator isn’t worth considering. This is made all the most confusing by the fact that Stan’s effective sample size estimator incorporates a completely heuristic modification that pushes the estimator to zero if hat{R} is large. Again in this case it’s not that the effective sample size is small but rather meaningless.
Finally there is no general guarantee that a posterior distribution will concentrate around the true model configuration, even in simulation settings where one simulates data from the same model used to construct the posterior. Intuitively the problem is that the likelihood function concentrates around model configurations consistent with the observed data which is not always near the true model configuration. When the data fluctuate in conspiratorial ways the likelihood function can concentrate away from the true model configuration, which is exactly the source of over-fitting. For nice models, where these conspiratorial fluctuations are rare or the prior model is able to suppress their influence, the posterior distribution will typically concentrate around the true model configuration, but this is not a guarantee and has to be verified for each analysis.
The consequence for a simulation study like this is that comparing an approximate posterior quantification to the true model configuration results in ambiguity. It could be that the exact posterior concentrates around the true model configuration, in which case any deviations indicate computational problems, or it could be that the posterior approximation is accurate but the exact posterior doesn’t concentrate around the true model configuration!
Here Stan’s empirical diagnostics clearly indicate that each Markov chain is not recovering a consistent posterior approximation, which strongly suggests computational problems in particular multimodality. Why the posterior from the simulated data is exhibiting such multimodality depends on the particular details of your model and the simulated data you are using.