Well, it depends on what you mean by “works”.
ADVI itself isn’t great and we are genuinely working to make it better. But there’s a natural barrier (the VI approximation) that is likely to prevent full posterior exploration. These problems have been known for years (I am saying nothing new, novel, insightful, or that every single person who’s ever touched VI shouldn’t know already).
For complex models, it’s unlikely that this barrier will be fixed any time soon (the key here is complex).
For simple models with a lot of data, you typically end up with a posterior that looks like a very concentrated gaussian, so these sorts of methods can work. Even if they misstate the uncertainty, the point estimate may be enough. But I have serious doubts in most cases about simple models being fitted to enormous data sets. Basically, I doubt people are thinking about the problems with the data gathering mechanism in these cases. I usually expect models to get more complex as you get more data, rather than staying the same.
As for stochastic optimisation, that’s not that far from ADVI. It’s optimistic to expect VI to give much more than a point estimate anyway, so it’s really just a specific optimisation method.
If you want optimisation, Stan has BFGS which works quite well on big data (there was an example somewhere at some point of some big company [maybe facebook] using it to do something). The problem with optimisation methods is that for a hierarchical model you really do not want the joint mode. You want conditional modes. At present, Stan doesn’t let you do that, but work is underway to relax those restrictions.
(Incidentally, I strongly agree that a well-deployed point estimator is better than nothing, and in a lot of these cases as good as you can get with a reasonable computational budget. In my field [spatial statistics], we call it “pragmatic Bayes”, where you build a Bayesian model and compute something from it to the best of your ability under your particular constraints.)