I know the general advice is to initialize Stan (or any MCMC algorithm) from widely dispersed values relative to the posterior. My understanding is this helps catch issues like multimodality or chains not mixing well, which might not occur if started from the same point.
But if the idea is to find local minima, why not use the optimizer? Start from the same dispersed values and optimize and check to see they all find the same mode (assuming for now it exists). It seems much more computationally friendly, as you could probably try orders of magnitude more inits than the say 4 chains you run.
What would be bad about doing this: using optimization on more points, and then starting from the mode that is found, so that during warmup the chains quickly move to the typical set.
I ask because in a case where a model is very slow, it seems this method could be used to save some warmup time, maybe shaving off half of the warmup or more, which could be hours of runtime. Plus, it would seem you would be more likely to find local minima than just using 4 NUTS chains.
In high dimensional case, the typical set is usually far away from the mode (see, e.g. Bob’s case study Typical Sets and the Curse of Dimensionality). Plus it’s more difficult to diagnose bad mixing if they start from the same point.
In my experience trying this, the chains converge to the typical set much, much faster if started at the mode vs way out in the tails. I’m talking models in the 100-300 parameter range.
Plus it’s more difficult to diagnose bad mixing if they start from the same point.
Could you elaborate beyond multimodality why this would be? What other properties of the posterior would be problematic? That’s exactly what I’m trying to understand better here. Thanks!
Yeah I wouldn’t recommend starting way far off in the tails for high dimensional models. The default is to initialize between -2 and 2 on the unconstrained space. If the parameters are reasonably scaled this should tend to be a good range for random initialization, but certainly not in every case, especially with complicated high dimensional models. It’s not uncommon that one needs to narrow the range for sampling to have a chance.
I think it’s really just that additional starting points are additional opportunities to find pathological regions of the posterior.
It’s also that Rhat is ratio of overestimate and underestimate of variance, but that overestimate is overestimate only if the starting points are diffuse. If you start all the chains from the same point, the Rhat diagnostic is not able to diagnose mixing problems.
Sure, there are models with no mixing problems and there are plenty of easy models with 100-300 parameters, but also if you started all chains from the same point then the convergence diagnostic is more likely to fail and indicate convergence even if the chains have not yet fully explored the posterior distribution.
In addition to what Aki and Jonah have noted, trying to restrict a Markov chain to a single mode in a multimodal model is usually a bad idea. Firstly even if you wanted to do that you’d still want to overdisperse your initial conditions in the neighborhood of that one mode to maximize the utility of the diagnostics and adaptation. Secondly there’s no guarantee that the chains would stay in that neighborhood – yes it’s hard to jump between modes but the chains will do it eventually and the more chains you run the more likely you’ll see a transition within a finite amount of time. In other words, starting chains around one mode does not guarantee that they will stay there.
By far the best strategy is to isolate what properties of your model manifest in a multimodality and modify your model to be properly identified so that the computation has the best chance to do well.
It’s not uncommon that one needs to narrow the range for sampling to have a chance.
By have a chance I assume you mean behave properly long enough to converge to the typical set? That makes sense to me. So there’s inits that are too underdispersed (mode) and inits that are too overdispersed as well. If we were to put this in terms of the posterior, would we want inits to be something like 1-5 SDs away for each posterior mode? Does that change as model size increases? That would be a more interpretable thing for us end users and I think helpful in thinking about how to write functions to generate random inits.
@betanalpha fair point and I understand that. My original quesiton was if you are worried about multimodality, why not use the optimizer to try and test for it? I just feel that starting 10000 random inits and optimizing is more likely to find another mode than 4 NUTS chains, all being evenly dispersed.
It seems to me from @avehtari that there are other pathologies to be tested for. I guess for me I have yet to run across any of them in the Stan models I’ve fit so it’s hard to imagine what to be cautious of.
Your 10000 optimizations land at 10000 different parameter values. How do you map out which of those are in similar modes and which are not? How do you discriminate between separated modes and a long, non-identified ridge of degenerate optima?
In either of those cases you feel good that you at least caught the multimodality and know you have to rework the model. Isn’t that more likely to find it than 4 NUTS chains? Do you think it’s not useful at all? I mean I thought this was a pretty widely accepted practice for numerical optimization. If all 10000 inits get to the same mode then you have more confidence in your model.
Diagnosing the issue is a separate issue. We see both of these in fisheries models sometimes. In the cases I’ve seen it’s quite easy to tell multimodality by looking at the log-density (or log-likelihood) values and parameter values. In the ridge case, the objective function will be the same and have a non-positive definite Hessian.
But regardless, the idea is to catch these issues and I still am not seeing why 4 NUTS chains are going to do a better job than 10000 optimizations. I’m not saying we should be doing optimization for inference, but rather use it for diagnostics since it is usually orders of magnitude faster than integration.
Of course I’m talking about the subset of models which have a mode, or for hierarchical models where we can use marginal maximum likelihood if available.
Chains explore entire neighborhoods, filling out the connected components of the posteriors and making it easier to differentiate between a diffuse but connected posterior and a truly multimodal/mixture posterior. A single optimization cannot do this and there is no theoretical guarantee that an ensemble of optimizations will provide any useful exploration as what they return will depend strongly on the termination conditions of the optimization algorithm and the structure of the target/parameterization used.
Even when running multiple chains, looking at log density plots is also not particularly helpful as you’re limited to one or two dimensional projections or marginalization of the high-dimensional target which can either fake multimodal structure or mask it.
Ultimately optimization is trying to solve a different problem than sampling. There is some overlap in the objectives but the fundamental difference in the problem means that there is no clean translation from the behavior of one into the behavior of the other. All heuristics will fail, and in high-dimensional spaces those heuristics will typically fail silently with no indication that something is wrong.
It’s not just multimodality—we want to make sure everything’s mixing and not getting stuck in regions of the posterior. For finding modes in cases where they exist, then sure, optimization’s a way better bet. You can see this with something like K-means. But that kind of multimodality is typically easy to diagnose on first principles without having to run any chains at all.
A further benefit to running multiple chains is that we can use the results from all of them.
PyMC3 tries to start with a variational approximation, which deals with non-existent modes in hierarchical models. But we’ve found it even harder to get ADVI to converge than NUTS, so we don’t tend to recommend that (not sure if their algorithm is tweaked somehow differently than ours).
What I really think has some promise longer term is to start an optimization and somehow figure out when you hit the typical set on the way to the mode or to divergence. Somewhere along the path you’ll go through the right energy levels.
Thanks for that. I wasn’t suggesting to scrap multiple chains, but essentially do what you said below: start dispersed and then run the optimizer and start the chains after relative convergence. As for “getting to the right energy levels” – it usually takes just a few iterations to get from the mode to the typical set so why bother with fussing where exactly to stop? (Presuming of course there is a mode and it’s reliable).
I’m still hoping someone will answer my more general question of how far into the tails it should be. If we have an N dimension posterior that is approximately multivariate normal, how many SDs away from the medians should we be? 3? 20? Depends on N and correlation?
Basically, as a user when I run a Stan model and explore the log-posterior in ShinyStan what is the warmup supposed to look like? 5000 units away from the typical set? 50000? Or maybe a better rule of thumb is approximately 50 iterations to get from inits to typical set?
Have you investigated how much of warmup is being spent finding the typical set and how much mixing through it? We need to mix through it in order to estimate the mass matrix and consequently step size for that mass matrix.
Usually finding the typical set isn’t a huge fraction of this total cost.
The main problem we have with optimization for initialization is that the typical models we work with don’t have modes. PyMC3 uses variational inference, but the problem we have with that is that we can’t make it reliable.
The second problem is that in high dimensions, the mode can be very far away from the typical set in energy. Of course, so can random draws on the unconstrained parameter space. These situations introduce different problems for adaptation. The mode is too peaky and the tails too flat, so step sizes are too small around the mode and too large in the tails.
This happens in your example, and probably in many others, but you can’t assume that it would happen for all models. If all chains are started close to same point and all chains get stuck, Rhat diagnostic may indicate convergence although the typical set has not yet been explored. Rhat diagnostic is more reliable if chains are started overdispersed. It’s easy to come up with multi-modal examples where getting stuck is likely. There are many latent Markov and GP models, where the posterior might be unimodal but the exploration can still be very slow,
For the fastest convergence of MCMC, it would be best to initialize with draws from the posterior as it’s then converged already. But if we would be able to initialize from the posterior, we don’t need MCMC.
Rhat diagnostic is most reliable if the initial points have larger variance than the posterior. Overdispersed initial points can be in typical set, they are overdispersed if the variance estimate from them is larger than the posterior variance. It’s enough that the variance of initial points is just slightly larger like 10%. But the problem is that we rarely know the posterior variance beforehand, so reasonable choice often would to draw initial values from the prior as that is usually overdispersed compared to the posterior. Stan doesn’t know which part of target is prior and which part is likelihood, so you would need to make these draws yourself. Stan’s default random initialization draws randomly from a uniform interval in unconstrained space (remember it doesn’t know which part is prior to help automatically to do something better) and these would be on expectation overdispersed for distrbutions with mean close to zero and approaximately unit variance, which of course doesn’t hold for all models, and this initialization can also be far from prior or posterior.
Further complication comes from thick tailed distributions which have infinite variance, so we need to define overdispersion in terms of quantiles etc.
If you choose your initial values well, you start in the typical set, but still have overdispersed initial points to get better reliability of Rhat. But at least try to get the initial values in the typical set of prior.
If you have a specific model and prior and you have tested that this works (see, e.g., [1804.06788] Validating Bayesian Inference Algorithms with Simulation-Based Calibration) and makes your sampling faster and you get repeatedly good results, then just use it (but in publication be prepared to tell that you have tested this for your specific model and prior). As we don’t for which models and priors the next person is going to use Stan, we have to keep recommending by default to use overdispered initial values which make the convergence diagnostic more reliable.
@avehtari That’s very interesting about starting in the typical set. I always interpreted “diffused relative to posterior” as meaning outside of this set. So, as a thought experiment, if you knew the posterior a priori and could sample from it, then a random sample would make for an ideal initialization?
@Bob_Carpenter Well in my case these are fixed effects models that are pretty well behaved in the typical set and MVN near the mode but have craziness way out in the tails. Plus, there are big differences in parameter scales and strong correlations so choosing good inits is a real challenge. But the optimizer always finds the same mode, which is why my field does that. (and yes I realize the models should be improved …) So how much time it spends finding the typical set varies really widely, but once there it adapts and samples very well. The other issue is the models are changed quite frequently, as alternative structures are tried (parameters assumed constant, etc.) and data sources swapped in/out, so the need to come up with inits needs to be easy and flexible.
Without a doubt NUTS has allowed us to look at these models in a new light and quickly identify ways to improve their structure. It’s really valuable for our field, but there’s still a big gap between status quo and a principled Bayesian workflow. But, small steps…
Yes. If you know it and can sample from it, you don’t need MCMC. If you know it and can sample from it, you don’t need warmup. If you know it and can sample from it, you don’t need convergence diagnostics. If you know it and can sample from it you don’t need overdispersed initial values. But since we don’t know it…
Yes obviously it’s a hypothetical, but it actually totally clears up what I should be looking for after I run my model wrt initial values (i.e., not as dispersed as I thought). So that’s helpful. Thanks.
Yeah. If we already know the right answer we would just start with the right answer ;) But since we never do know it in advance, the most robust general purpose way to find it and to diagnose when we can’t find it is to disperse the initial values.
(Edit: wrote this before seeing Aki’s response. It’s basically the same.)