Do you think it is fair/legitimate to use the least square or optimization estimated parameter values to initialize the chain? I noticed in some cases the computation speeds up a lot if I do so. Following this point, since we use domain knowledge to set weakly informative priors for each parameter, then is it fair to initialized the chain with for instance the mean value of the corresponding prior? If it is fair to do so and it speeds up the computation, why don’t we always initialize with this strategy?
I ask this since I read in some literature people use the domain knowledge to initialize the chain of Gibbs sampling by using either expected parameter values or least square solutions, and claim that the warm-up samples will not be used eventually, then it is fair. However, I wonder this would increase the chance of reporting false positives?
Random inits help you find things like multiple modes or weirdnesses with your posterior you don’t expect especially in early model development. If you’re happy with your model, just want to run it faster, and you can reliably compute a mode, go for it.
Another thing you’ll run across with random inits is the idea that Rhat will be more reliable if you sample the different chain initial conditions in an overdispersed way around the posterior. I think as the basic Rhat formulation goes you assume the cross chain variance is greater than the within chain variance. I think BDA3 has a description of this (pdf here: Home page for the book, "Bayesian Data Analysis").
Anyway, if you start all the chains at the same point, the fear is that the cross chain variance won’t be as high as the within chain variance and that will mess up Rhat. I think practically this has not been as important as it sounds like it could be – Rhat still mostly works alright in these situations.
If you model is complicated enough to see important efficiency gains from judiciously chosen inits, and you’ve done enough diagnostic checking to feel comfortable that you no longer need random overdispersed inits, then you should be able to do even better than the mode by initializing chains at random draws from the approximate model posterior (as determined by your previous diagnostic runs). In addition to avoiding running an optimizer to find the mode, an additional advantage arises because the (multivariate) mode is often completely outside the “typical set” where exploration happens.
My suspicion is that running an optimizer to find the mode and initializing there looks attractive primarily in cases where it turns out that your confidence in the computational aspects of the model is (not necessarily wrong but) optimistic.
That’s a good point. Indeed if I use optimization results to initialize a model with a larger amount of parameters, it might be a bad initialization though. I guess to make it work the optimization really need to be well tuned.
in theory ad infinite time the starting point doesn’t matter
in practice and when the computation and diagnostics try to be robust and safe it’s better to start from different and overdispersed locations
in practice optimization result as a starting point can be a useful shortcut in early iterations of the workflow, but then you need to be more careful to check that results are sensible. I give an example of this in this talk and accompanied case study, and Charles discusses initial value problems in this talk and in a case study which is part of the Bayesian workflow paper.