I don’t think there’s any “best practice” as the practice is inherently quite tricky and not widely recommended. If possible, one should always strive to avoid the multimodality completely.
If the modes are indeed very well separated, then inits should work quite well, but if the separation is not very strong, some chains will try to switch between the modes, which will usually result in divergent transitions. In those case, hard constraints to restrict to individual modes could work a bit better (or not).
To combine chains sampling different modes, you’d probably want to do Bayesian stacking instead of WAIC.
In some models, there is also a third way to do this:
- Introduce a discrete variable that splits the parameter space such that each potential mode belongs to different value of the discrete variable
- Marginalize the variable out
Now you can sample both modes in a single chain. There’s an example of this approach at Ideas for modelling a periodic timeseries - #25 by martinmodrak where I index a set of possible modes over a frequency spectrum
Best of luck with your model!