This happens in your example, and probably in many others, but you can’t assume that it would happen for all models. If all chains are started close to same point and all chains get stuck, Rhat diagnostic may indicate convergence although the typical set has not yet been explored. Rhat diagnostic is more reliable if chains are started overdispersed. It’s easy to come up with multi-modal examples where getting stuck is likely. There are many latent Markov and GP models, where the posterior might be unimodal but the exploration can still be very slow,
For the fastest convergence of MCMC, it would be best to initialize with draws from the posterior as it’s then converged already. But if we would be able to initialize from the posterior, we don’t need MCMC.
Rhat diagnostic is most reliable if the initial points have larger variance than the posterior. Overdispersed initial points can be in typical set, they are overdispersed if the variance estimate from them is larger than the posterior variance. It’s enough that the variance of initial points is just slightly larger like 10%. But the problem is that we rarely know the posterior variance beforehand, so reasonable choice often would to draw initial values from the prior as that is usually overdispersed compared to the posterior. Stan doesn’t know which part of target is prior and which part is likelihood, so you would need to make these draws yourself. Stan’s default random initialization draws randomly from a uniform interval in unconstrained space (remember it doesn’t know which part is prior to help automatically to do something better) and these would be on expectation overdispersed for distrbutions with mean close to zero and approaximately unit variance, which of course doesn’t hold for all models, and this initialization can also be far from prior or posterior.
Further complication comes from thick tailed distributions which have infinite variance, so we need to define overdispersion in terms of quantiles etc.
If you choose your initial values well, you start in the typical set, but still have overdispersed initial points to get better reliability of Rhat. But at least try to get the initial values in the typical set of prior.
If you have a specific model and prior and you have tested that this works (see, e.g., https://arxiv.org/abs/1804.06788) and makes your sampling faster and you get repeatedly good results, then just use it (but in publication be prepared to tell that you have tested this for your specific model and prior). As we don’t for which models and priors the next person is going to use Stan, we have to keep recommending by default to use overdispered initial values which make the convergence diagnostic more reliable.