I am working on a model of high dimensionality (a hierarchical model with ~50000 parameters at the lowest level). Initializing with Stan’s default (uniform [-2,2]) causes numerical problems. Seeding with random draws from the priors kind of works, but it takes a very long time for chains to converge. Sometime 2000 isn’t enough, and that takes around 48 h to compute. I want to give Stan a hand by starting the chains at reasonable values, and, for at least some parameters, I can calculate decent guesses based on the data. Is there a good practice for how to over-disperse initial values around my data-based guesses? Alternatively, is there a diagnostic tool to check if the initialization was over-dispersed (as apposed to check if chains have converged, assuming they were seeded with over-dispersion)?
The recommendation to use over-dispersed initial values is to make \widehat{R} diagnostic more conservative. In high dimensions, it is very difficult to generate over-dispersed initial values that would at the same be near the interesting posterior region (sometimes called typical set). It is likely that over-dispersion is not a strict requirement even in theory. Section 5.1 of
has a recent theoretical result. The worry about underdispersed initial points was bigger when only a few short chains were run in 1990’s (due to computers being much slower than today), so that it was possible just by chance to observe early small Rhat (the threshold was also much higher at that time). With long efficient chains (like 1000 iterations of HMC/NUTS in Stan) or with a high number of short but efficient chains (like 100+ chains of few iterations of HMC/NUTS) there is less worry of getting just by chance a small Rhat value. You could examine the behavior of Rhat or even easier to interpret to examine the effective sample size with increasing number of iterations as in Rank-Normalization, Folding, and Localization: An Improved Rˆ for Assessing Convergence of MCMC (with Discussion).