Strictly speaking, the optimizer computes the maximum of the objective function, not the maximum of the likelihood. The objective function includes the prior too. That said, when putting together a Stan model, the distinction between prior and likelihood is not clear, as each is just more data, as far as the objective function is concerned.
Is there an integration with Stan that works that we can turn into a pull request? Lots of things work in theory that are hard to integrate. My understanding is that this requires a way to partition parameters and that nobody wanted to do this the inefficient way, so therefore it needs changes to underlying Stan language that have yet to be clearly specified so that we can build them.
@andrewgelman has been working on the GMO algorithm, which has never been fast enough for him to want to include in Stan. I think part of that may be related to partitioning and another part may be the need to tune the nested MCMC to be robust but not too inefficient.
Most of our models simply don’t work with optimization because they’re hierarchical. I’m not so concerned about comparisons to other methods, but that’s just me.
The way it is set up now gets the usual MLE and MAP estimates on the natural scale rather than giving you the MLE and MAP of the uncontrained model with Jacobians.
This already is a flag in the C++. It’s just a matter, like everything else, of someone taking the time to plumb the C++ through to the interfaces.
I’m suggesting that we might want to focus on full Bayes and we might want to focus on implementing tried-and true algorithms in a stable package. I’m all for adding a new full Bayes algorithm if we can demonstrate it’s better than NUTS in a useful range of cases without hand tuning. Gibbs isn’t a good match for the language, Metropolis doesn’t really mix well in high dimensions, and neither do ensemble methods. SMC perhaps for some structured problems? Is there a full Bayes algorithm out there you think should be in Stan and isn’t? A lot of people are working on algorithms research, and even evaluating those algorithms is a huge undertaking. If we did anything, I’d suggest we focus on that, but it’s not so fundable.
There are other projects, like INLA and lme4, that build inference algorithms for hierarchical models that aren’t full Bayes. I think we should consider ceding the ground to those projects rather than trying to build all the styles of inference we like into a single package.
I seriously want to avoid becoming the scikit-learn of stats where the user is faced with a grab bag of techniques and encouraged to explore a cross-product of models and inference algorithms for problems.
If we want to try to maximize audience, then we should be going after the data science crowd learning Python and learning Bayes and learning regression and concentrate on relatively simple pre-canned models. If we want to maximize the intermediate audience, that’s brms and rstanarm and building the same tools in Python and Julia.
Does that mean you think it shouldn’t have been included in Stan? If not, where do you think the robustness (how many problems can it just fail on) and the accuracy (how good is it as an approximation of full Bayes) need to be? And how black-box does it need to be? Can we build a generic INLA that works for any Stan model by partitioning the parameters into two sets?
I guess I should ask up front if you think of algorithms like INLA and lme4 as a kind of approximate Bayes the way @andrewgelman does? In which case, how do we evaluate w.r.t. the full posterior? Is ADVI mean-field just a non-starter because there’s no covariance, for example?
Let me substitute “latest” for “current”. And yes, there’s a version that @seantalts implemented that didn’t pass muster for going into Stan (I wasn’t involved, but I’m guessing @andrewgelman didn’t think it was fast or robust enough).
My point’s just that we don’t have a candidate algorithm for going into Stan yet! Or even a design for one that I know of.
So let’s talk about specifics when there’s a candidate on the table. Until then, we’re doing algorithms research.
Absolutely. I think our strength now is the language, robust algorithms, and community, as well as a lot of auxiliary packages like brms and rstanarm.
I wouldn’t be. How did YouTube beat Google at being YouTube? How did Facebook beat Google at being Facebook? How’d Instagram grow up in the shadow of Google and Facebook? Or What’s App? The problem is that scale makes you diffuse, so other people can jump in and build specialized things better than juggernauts can, because they can focus.
My worry is that if we spread ourselves too thin, we won’t be the best at anything.
Sure, if Google or Facebook decided they wanted to assign 20 engineers to a project for a couple years and own the full Bayes market for applied statisticians, they could. There’s nothing we’re doing that’s a secret sauce. But it’s not a particularly profitable market and owning all of SAS wouldn’t even dent their bottom line.
Who’s better funded than us to work on full Bayes? I don’t mean who’s better funded than us, but who’s better fundd to work on full Bayesian inference? I don’t know of anyone. NONMEM’s better funded for ODEs than us, but they stick to a limited set of models and mainly focus on point estimation. TensorFlow Probablity and Pyro/PyTorch are way better staffed than us (despite their employees costing way more than anyone who works on Stan), but are focusing on variational inference algorithms that run on clusters and fit neural networks. Google could own SAS if they wanted that market but they don’t (SAS takes in $3G/year and Google $100G/year).
Better, more stable interfaces and I/O formats, for a start.
You may not, but Stan’s a perfectly reasonable platform that lets you write down any old penalized likelihood you want and perform MLE. There’s the issue of us working on the unconstrained scale, but not applying the Jacobian—in that way, I think we’re like lme4.
I didn’t mean to imply otherwise. I’m pretty sure my writeup of the algorithms is correct—GMO is a form of MCMC-EM.
“lme4” is the name of the package and “glmer” is the name of a function in that package.
I agree, too!
You are less patient than I am. I haven’t felt the need to use approximate algorithms, but then I’ll wait a day for NUTS to finish a run. What’s your time cutoff at which point you say you need approximate algorithms? And then how do you test the inferences from the approximate algorithm to know what you’re losing from the approximation?
How are you getting implementations of EP, VB, and GMO that you can use? What I’m confused about is that everyone (OK, just Krzysztof, Andrew, and Dan and Aki) is talking as if these other algorithms exist and it’s only stubbornness on the part of the Stan developers that’s keeping them out of Stan. What I’m missing is the algorithms that will work for Stan. We tried GMO and it’s a bust as far as I can tell. We tried ADVI and I’m still not sure where that stands. What next?
I get that faster is better. But I don’t get when the approximations are OK. If they’re always OK and nobody cares about full Bayes, why have I wasted all this time on Stan? If they’re not always OK, how do we convey when they are OK to users?
How hard what is to program? I feel like I’m in a house of mirrors here. Is there an algorithm on the table that does efficient approximate marginal optimization that we haven’t evaluated that you think has potential? This makes it sound like there’s an algorithm somewhere that’s just waiting for a programmer. Believe me, if it’s only programming, we can make it happen. The hard part is evaluation, and that requires serious statisticians, which are much harder to find.
Let me ask this another way. Do you care about the max marginal posterior modes in and of themselves, or is it just that you’ll take them if you can get them fast. That is, let’s say we build GMO that runs about as fast as NUTS. Does that have any use? If not, then I’m less keen on building a fast, but inferior inference algorithm. Not to say others can’t do it or I’ll try to block it getting into Stan, just that I’m not personally that keen on lossy approximations.
This is already possible at the C++ level and just needs the flag to be plumbed through the C++ to the services, and from there to all the interfaces. Anything that has to hit multiple repos like this takes a long time to propagate through.
Correct, because that’s what everyone insisted on when we first built it. As I recall, it was a fight with @andrewgelman to get optimization in because he didn’t like it. Matt and I just thought it’d be easy given that we had an objective function and gradients. I had to do a ton of extra work to turn those Jacobians off for the MLE calculations.
The sampling log density includes the log Jacobian adjustments for the inverse transforms for the constrained parameters. The optimization objective function is different in that it does not include the Jacobian adjustments for constrained parameters. This is so that it computes the MLE on the natural parameter scale with priors taken as penalty functions, not the posterior mode on the transformed scale with Jacobians included. Everyone insisted this was the right thing to do when I built it, even though I couldn’t really follow what was going on at the time. I’ve since written a case study to try to explain how this all works in theory and in Stan.
Just a data point re the optimization / other algorithms thing – I’m using optimization and would be unhappy to see it go, and it wouldn’t exactly be pleasant from a backwards compatibility side of things – I coded up a framework for handling this (optional importance sampling etc), for instance.
Some thoughts after reading Bob’s post
- A specific version of GMO has been not-fast, but the loose definition of GMO includes INLA, which causes confusion almost every time GMO is discussed (it seems likely to be the case in this thread, too).
- Full Bayes vs approximative Bayes: I assume full Bayes here means that we can get arbitrary accuracy in integration over the full posterior. In MCMC this may mean infinite time, but it’s still usually called full Bayes. For a very large set of models, INLA is more accurate than the default 4000 draws by Stan. Can we then say that INLA is full Bayes? We don’t get arbitrary accuracy, but we can get something which is for practical purposes better than full MCMC. Using IS with Laplace we can also get an arbitrary accuracy.
- INLA includes the nested LA, and for Stan we could just have first ILA. The nested part can be used to compute non-Gaussian marginals.
- The models where Laplace approximation works great are Gaussian latent variable models. Gaussian process is one example. When a Gaussian prior is combined with Gaussian likelihood we can integrate over the latent values exactly. When a Gaussian prior is combined with non-Gaussian log-concave likelihood, we can use Laplace approximation to integrate over the latent values usually more accurately than MCMC. It will be these special models, but the work the user makes is not different to writing models using ODEs, GPs, 1D integrator, etc.
- INLA (even without the nested part) is very accurate for a very large set of models. Adding it requires some work and there is some work slowly going on. It can be made to work without changes to the language. The changes in the language later may be useful to make it easier to use. We don’t need to make partition of parameters (oh, I prefer to talk details of this live)
- So why add Laplace approximation if there is INLA software implementing that well already? INLA can currently handle a very large number of latent parameters, but only max 15-20 higher level parameters, while Stan could handle thousands of higher level parameters and there are plenty of that kind of models, too.
- Based on Matt Hoffman’s recent work (an abstract and part of the poster online) it might be possible to automatically detect parts which have Gaussian latent variable model structure and could be integrated away using Laplace approximation, but that would be much after the specific Laplace approximation.
Yes, but did it work? http://proceedings.mlr.press/v80/yao18a.html
There is pull request by @yuling, but there is some reason why it’s not yet merged.
We could add full covariance using linear response theory as shown in
Covariances, Robustness, and Variational Bayes
with examples using Stan (but no pull request as far as I know)
- Laplace for marginalizing latent variables in Gaussian latent variable models with log-concave likelihoods usually has the accuracy of full Bayes. We have a diagnostic to check.
- Laplace for the joint posterior in general is not full Bayes, but it can have the accuracy of full Bayes. We have a diagnostic to check.
- Meanfield ADVI for the joint posterior in general is not full Bayes, but it can have the accuracy of full Bayes. We have a diagnostic to check.
- There are people who are happily using meanfield ADVI in Stan, but they seem to bit shy in reporting the use in public.
- There are some ways ADVI could be improved, but there is much uncertainty in which cases these would make a big difference, so this is clearly in the algorithmic research part.
Thanks, Aki—that was really informative.
I’ll stop using the term “GMO”, since I clearly don’t understand what it means. I should just take it out of that writeup—I was just going on what Andrew wrote in the paper and hadn’t realized it was a more general concept.
I didn’t realize INLA was using gradients (the “G” in GMO) and I had assumed that the nested MCMC expectation calculation was central to GMO, since that’s what led Andrew to the idea in the first place from ADVI.
Cool. I’m even more confused about what INLA is than I am about what GMO is. Dan tried to explain it to me, but it left me with more questions than answers. It sounded like it needed a lot of custom work per model and that it used quadrature for the nested integral. That’s about what I got out of it.
I’m going to try to stop using “full Bayes” as it’s also confusing. One sense is whether the predictive inference is Bayesian (averages over posterior uncertainty), whether it be an approximate posterior or not. The second is whether it’ll get arbitrary accuracy or not.
This second point is confounded by finite accuracy. I’m used to thinking about things like variational inference where you wind up not getting very close to the true posterior (in a variational sense), especially in mean field versions.
How’s the efficiency? That would be super simple to implement.
I have no idea what “ILA” refers to here (or if it was supposed to be “INLA”).
Why is INLA limited in this way?
Also, is INLA an R package, a general technique, or what? I’m really confused on all this nomenclature.
Thanks for reminding me. I honestly didn’t realize that Yuling had produced a practicaly useful method. It just looked like a lot of dense math at StanCon! I looked at the paper you linked, which says:
If kˆ < 0.5, we can invoke the central limit theorem to suggest PSIS has a fast convergence rate. We conclude the variational approximation q is close enough to the true density.
I’m not sure what “close enough” means here. Is it some kind of variational measure like KL-divergence? Or is it going to say something about how accurate expectations are?
I guess I need to go read your other paper. I’ll put it on the queue after learning OCaml.
It wasn’t passing tests until I restarted Travis yesterday. Also, Ben Bales has change requests that haven’t been addressed. The PR is here:
That’d be great. Just to be clear, when you say “accuracy of full Bayes”, does that mean there’s similar expected error in expectations given similar compute time budgets?
You mean on the marginal posteriors? Obviously there’s no covariance (unless we start implementing Tamara’s methods [I’m assuming that’s where your link led earlier]).
I assume they confided in you privately. Are you allowed to say what they are using it for? Otherwise, I have no idea what to do with this information. Lots of people are happily running BUGS in a single chain with no convergence diagnostics at all, but I hardly want to take that as a sign that we should be doing it!
It started as more narrow concept, but then extended and turned to be not well defined.
Andrew’s original GMO idea uses the gradient of the joint from autodiff, and then tries to do the marginalization without explicit form for the integral. INLA and LA can use the gradient of the marginal density directly as it has analytic form.
I think you got the most important parts. Below are combined answer to many later questions
- INLA is both and algorithm and software
- INLA is acronym for integrated nested Laplace approximation
- LA part uses Laplace approximation to compute marginal density, and we can also compute gradients of the marginal with respect to not marginalized parameters
- LA part requires custom work per model family, but that custom work covers very wide range of common models used e.g. in rstanarm and brms and which could be used also as building blocks in plain Stan. We could think these also as re-usable compound lpdf/lpmf functions
- I part uses integration to integrate over the not-marginalized parameters. INLA software uses central composite design (CCD) for this integration. CCD requires much less marginal density evaluations than MCMC, but doesn’t work well beyond 15 dimensions. It doesn’t work well beyond 15 dimensions, because it gets really difficult to find the typical set in higher dimensions. Stan would use MCMC instead (slower but scales to higher dimensions).
- N part uses nested Laplace approximation, also known as Tierney-Kadane-Laplace approximation to compute non-Gaussian marginals of the marginalized-parameters. Direct marginal from Laplace approximation would be Gaussian, but using another level of Laplace approximation it is possible to evaluate marginal density (usually in 1D) in a grid and then extrapolate rest. In many cases N part is not needed, but it is there for more difficult cases. We donät yet have plans for adding N part in Stan. Instead of nested Laplace, we could use importance sampling (and I have used this approach in GPstuff)
- ILA would be then integrated Laplace approximation
Even the mean field VI can sometimes get very close (in the sense of able to get arbitrary accuracy with importance sampling correction)
The diagnostic will tell the approximate convergence rate with increasing number of draws if the approximation is used as an importance sampling proposal distribution. We can estimate Monte Carlo error and how many more draws we need for the desired accuracy. Monte Carlo error estimates are described in “Pareto smoothed importance sampling” arXiv preprint arXiv:1507.02646. and they have been used in loo package since version 2.0
The cases where we would like to use Laplace approximation, we would expect to get similar error in less time. Of course this will hold only for the certain set of models, which just happens to be very popular.
In those cases where mean field it is accurate enough, we can get the covariance using importance sampling. Tamara’s method would be great help in cases where importance sampling with mean field proposal fails. And after using Tamara’s method we would check the improved approximation again with importance sampling diagnostic.
I know they are using it for that kind of models where they know that the posterior is simple and there is a lot of data. I can ask more, but I expect the answer to be vague as they are people working in companies.
Thanks—that clarifies a lot.
OK. That’s the part I still don’t understand properly.
That would presumably be when there’s not a lot of posterior correlation. Otherwise, in high dimensions, the draws to importance weight won’t be close enough.
That’s fantastic. And also thanks for explaining it in terms I can understand.
That’s neat, too. Could we also use dense ADVI (I think it’s called full rank or something in the interface)?
What happens with things like varying curvature in the posterior?
And thanks so much again for taking the time to lay down simple explanations of all this. It’s really helpful.
The nested Laplace part is the most complex in INLA, and it includes several shortcuts and approximations to get a significant speed-up but maintaining a good accuracy compared to basic Tierney-Kadane-Laplace approximation. One thing is that it is practical only for improving 1D and 2D marginals. Other thing is that in many cases it is not needed.
It seems dense / full rank ADVI is much more difficult to get to converge (also noted by Giordano, Broderick, Jordan), and no-one seems to know why. Understanding of VI algorithms is still far from understanding of Laplace and MCMC.
Any Gaussian approximation will fail at least when the number of dimensions increase.
No problem. I’m happy if my explanations are helpful.
Yet, variational inference is the most popular thing out there because it can be thrown at big problems with neural network models. This suggests we should think about methods for approximately validating and fixing VI.