Strictly speaking, the optimizer computes the maximum of the objective function, not the maximum of the likelihood. The objective function includes the prior too. That said, when putting together a Stan model, the distinction between prior and likelihood is not clear, as each is just more data, as far as the objective function is concerned.
Is there an integration with Stan that works that we can turn into a pull request? Lots of things work in theory that are hard to integrate. My understanding is that this requires a way to partition parameters and that nobody wanted to do this the inefficient way, so therefore it needs changes to underlying Stan language that have yet to be clearly specified so that we can build them.
@andrewgelman has been working on the GMO algorithm, which has never been fast enough for him to want to include in Stan. I think part of that may be related to partitioning and another part may be the need to tune the nested MCMC to be robust but not too inefficient.
Most of our models simply donât work with optimization because theyâre hierarchical. Iâm not so concerned about comparisons to other methods, but thatâs just me.
The way it is set up now gets the usual MLE and MAP estimates on the natural scale rather than giving you the MLE and MAP of the uncontrained model with Jacobians.
This already is a flag in the C++. Itâs just a matter, like everything else, of someone taking the time to plumb the C++ through to the interfaces.
Iâm suggesting that we might want to focus on full Bayes and we might want to focus on implementing tried-and true algorithms in a stable package. Iâm all for adding a new full Bayes algorithm if we can demonstrate itâs better than NUTS in a useful range of cases without hand tuning. Gibbs isnât a good match for the language, Metropolis doesnât really mix well in high dimensions, and neither do ensemble methods. SMC perhaps for some structured problems? Is there a full Bayes algorithm out there you think should be in Stan and isnât? A lot of people are working on algorithms research, and even evaluating those algorithms is a huge undertaking. If we did anything, Iâd suggest we focus on that, but itâs not so fundable.
There are other projects, like INLA and lme4, that build inference algorithms for hierarchical models that arenât full Bayes. I think we should consider ceding the ground to those projects rather than trying to build all the styles of inference we like into a single package.
I seriously want to avoid becoming the scikit-learn of stats where the user is faced with a grab bag of techniques and encouraged to explore a cross-product of models and inference algorithms for problems.
If we want to try to maximize audience, then we should be going after the data science crowd learning Python and learning Bayes and learning regression and concentrate on relatively simple pre-canned models. If we want to maximize the intermediate audience, thatâs brms and rstanarm and building the same tools in Python and Julia.
Does that mean you think it shouldnât have been included in Stan? If not, where do you think the robustness (how many problems can it just fail on) and the accuracy (how good is it as an approximation of full Bayes) need to be? And how black-box does it need to be? Can we build a generic INLA that works for any Stan model by partitioning the parameters into two sets?
I guess I should ask up front if you think of algorithms like INLA and lme4 as a kind of approximate Bayes the way @andrewgelman does? In which case, how do we evaluate w.r.t. the full posterior? Is ADVI mean-field just a non-starter because thereâs no covariance, for example?
Let me substitute âlatestâ for âcurrentâ. And yes, thereâs a version that @seantalts implemented that didnât pass muster for going into Stan (I wasnât involved, but Iâm guessing @andrewgelman didnât think it was fast or robust enough).
My pointâs just that we donât have a candidate algorithm for going into Stan yet! Or even a design for one that I know of.
So letâs talk about specifics when thereâs a candidate on the table. Until then, weâre doing algorithms research.
Absolutely. I think our strength now is the language, robust algorithms, and community, as well as a lot of auxiliary packages like brms and rstanarm.
I wouldnât be. How did YouTube beat Google at being YouTube? How did Facebook beat Google at being Facebook? Howâd Instagram grow up in the shadow of Google and Facebook? Or Whatâs App? The problem is that scale makes you diffuse, so other people can jump in and build specialized things better than juggernauts can, because they can focus.
My worry is that if we spread ourselves too thin, we wonât be the best at anything.
Sure, if Google or Facebook decided they wanted to assign 20 engineers to a project for a couple years and own the full Bayes market for applied statisticians, they could. Thereâs nothing weâre doing thatâs a secret sauce. But itâs not a particularly profitable market and owning all of SAS wouldnât even dent their bottom line.
Whoâs better funded than us to work on full Bayes? I donât mean whoâs better funded than us, but whoâs better fundd to work on full Bayesian inference? I donât know of anyone. NONMEMâs better funded for ODEs than us, but they stick to a limited set of models and mainly focus on point estimation. TensorFlow Probablity and Pyro/PyTorch are way better staffed than us (despite their employees costing way more than anyone who works on Stan), but are focusing on variational inference algorithms that run on clusters and fit neural networks. Google could own SAS if they wanted that market but they donât (SAS takes in $3G/year and Google $100G/year).
Better, more stable interfaces and I/O formats, for a start.
You may not, but Stanâs a perfectly reasonable platform that lets you write down any old penalized likelihood you want and perform MLE. Thereâs the issue of us working on the unconstrained scale, but not applying the Jacobianâin that way, I think weâre like lme4.
I didnât mean to imply otherwise. Iâm pretty sure my writeup of the algorithms is correctâGMO is a form of MCMC-EM.
âlme4â is the name of the package and âglmerâ is the name of a function in that package.
I agree, too!
You are less patient than I am. I havenât felt the need to use approximate algorithms, but then Iâll wait a day for NUTS to finish a run. Whatâs your time cutoff at which point you say you need approximate algorithms? And then how do you test the inferences from the approximate algorithm to know what youâre losing from the approximation?
How are you getting implementations of EP, VB, and GMO that you can use? What Iâm confused about is that everyone (OK, just Krzysztof, Andrew, and Dan and Aki) is talking as if these other algorithms exist and itâs only stubbornness on the part of the Stan developers thatâs keeping them out of Stan. What Iâm missing is the algorithms that will work for Stan. We tried GMO and itâs a bust as far as I can tell. We tried ADVI and Iâm still not sure where that stands. What next?
I get that faster is better. But I donât get when the approximations are OK. If theyâre always OK and nobody cares about full Bayes, why have I wasted all this time on Stan? If theyâre not always OK, how do we convey when they are OK to users?
How hard what is to program? I feel like Iâm in a house of mirrors here. Is there an algorithm on the table that does efficient approximate marginal optimization that we havenât evaluated that you think has potential? This makes it sound like thereâs an algorithm somewhere thatâs just waiting for a programmer. Believe me, if itâs only programming, we can make it happen. The hard part is evaluation, and that requires serious statisticians, which are much harder to find.
Let me ask this another way. Do you care about the max marginal posterior modes in and of themselves, or is it just that youâll take them if you can get them fast. That is, letâs say we build GMO that runs about as fast as NUTS. Does that have any use? If not, then Iâm less keen on building a fast, but inferior inference algorithm. Not to say others canât do it or Iâll try to block it getting into Stan, just that Iâm not personally that keen on lossy approximations.
This is already possible at the C++ level and just needs the flag to be plumbed through the C++ to the services, and from there to all the interfaces. Anything that has to hit multiple repos like this takes a long time to propagate through.
Correct, because thatâs what everyone insisted on when we first built it. As I recall, it was a fight with @andrewgelman to get optimization in because he didnât like it. Matt and I just thought itâd be easy given that we had an objective function and gradients. I had to do a ton of extra work to turn those Jacobians off for the MLE calculations.
The sampling log density includes the log Jacobian adjustments for the inverse transforms for the constrained parameters. The optimization objective function is different in that it does not include the Jacobian adjustments for constrained parameters. This is so that it computes the MLE on the natural parameter scale with priors taken as penalty functions, not the posterior mode on the transformed scale with Jacobians included. Everyone insisted this was the right thing to do when I built it, even though I couldnât really follow what was going on at the time. Iâve since written a case study to try to explain how this all works in theory and in Stan.
Just a data point re the optimization / other algorithms thing â Iâm using optimization and would be unhappy to see it go, and it wouldnât exactly be pleasant from a backwards compatibility side of things â I coded up a framework for handling this (optional importance sampling etc), for instance.
Some thoughts after reading Bobâs post
- A specific version of GMO has been not-fast, but the loose definition of GMO includes INLA, which causes confusion almost every time GMO is discussed (it seems likely to be the case in this thread, too).
- Full Bayes vs approximative Bayes: I assume full Bayes here means that we can get arbitrary accuracy in integration over the full posterior. In MCMC this may mean infinite time, but itâs still usually called full Bayes. For a very large set of models, INLA is more accurate than the default 4000 draws by Stan. Can we then say that INLA is full Bayes? We donât get arbitrary accuracy, but we can get something which is for practical purposes better than full MCMC. Using IS with Laplace we can also get an arbitrary accuracy.
- INLA includes the nested LA, and for Stan we could just have first ILA. The nested part can be used to compute non-Gaussian marginals.
- The models where Laplace approximation works great are Gaussian latent variable models. Gaussian process is one example. When a Gaussian prior is combined with Gaussian likelihood we can integrate over the latent values exactly. When a Gaussian prior is combined with non-Gaussian log-concave likelihood, we can use Laplace approximation to integrate over the latent values usually more accurately than MCMC. It will be these special models, but the work the user makes is not different to writing models using ODEs, GPs, 1D integrator, etc.
- INLA (even without the nested part) is very accurate for a very large set of models. Adding it requires some work and there is some work slowly going on. It can be made to work without changes to the language. The changes in the language later may be useful to make it easier to use. We donât need to make partition of parameters (oh, I prefer to talk details of this live)
- So why add Laplace approximation if there is INLA software implementing that well already? INLA can currently handle a very large number of latent parameters, but only max 15-20 higher level parameters, while Stan could handle thousands of higher level parameters and there are plenty of that kind of models, too.
- Based on Matt Hoffmanâs recent work (an abstract and part of the poster online) it might be possible to automatically detect parts which have Gaussian latent variable model structure and could be integrated away using Laplace approximation, but that would be much after the specific Laplace approximation.
Yes, but did it work? Yes, but Did It Work?: Evaluating Variational Inference
There is pull request by @yuling, but there is some reason why itâs not yet merged.
We could add full covariance using linear response theory as shown in
Covariances, Robustness, and Variational Bayes
http://www.jmlr.org/papers/v19/17-670.html
with examples using Stan (but no pull request as far as I know)
Summary
- Laplace for marginalizing latent variables in Gaussian latent variable models with log-concave likelihoods usually has the accuracy of full Bayes. We have a diagnostic to check.
- Laplace for the joint posterior in general is not full Bayes, but it can have the accuracy of full Bayes. We have a diagnostic to check.
- Meanfield ADVI for the joint posterior in general is not full Bayes, but it can have the accuracy of full Bayes. We have a diagnostic to check.
- There are people who are happily using meanfield ADVI in Stan, but they seem to bit shy in reporting the use in public.
- There are some ways ADVI could be improved, but there is much uncertainty in which cases these would make a big difference, so this is clearly in the algorithmic research part.
Thanks, Akiâthat was really informative.
Iâll stop using the term âGMOâ, since I clearly donât understand what it means. I should just take it out of that writeupâI was just going on what Andrew wrote in the paper and hadnât realized it was a more general concept.
I didnât realize INLA was using gradients (the âGâ in GMO) and I had assumed that the nested MCMC expectation calculation was central to GMO, since thatâs what led Andrew to the idea in the first place from ADVI.
Cool. Iâm even more confused about what INLA is than I am about what GMO is. Dan tried to explain it to me, but it left me with more questions than answers. It sounded like it needed a lot of custom work per model and that it used quadrature for the nested integral. Thatâs about what I got out of it.
Iâm going to try to stop using âfull Bayesâ as itâs also confusing. One sense is whether the predictive inference is Bayesian (averages over posterior uncertainty), whether it be an approximate posterior or not. The second is whether itâll get arbitrary accuracy or not.
This second point is confounded by finite accuracy. Iâm used to thinking about things like variational inference where you wind up not getting very close to the true posterior (in a variational sense), especially in mean field versions.
Howâs the efficiency? That would be super simple to implement.
I have no idea what âILAâ refers to here (or if it was supposed to be âINLAâ).
Why is INLA limited in this way?
Also, is INLA an R package, a general technique, or what? Iâm really confused on all this nomenclature.
Thanks for reminding me. I honestly didnât realize that Yuling had produced a practicaly useful method. It just looked like a lot of dense math at StanCon! I looked at the paper you linked, which says:
If kË < 0.5, we can invoke the central limit theorem to suggest PSIS has a fast convergence rate. We conclude the variational approximation q is close enough to the true density.
Iâm not sure what âclose enoughâ means here. Is it some kind of variational measure like KL-divergence? Or is it going to say something about how accurate expectations are?
I guess I need to go read your other paper. Iâll put it on the queue after learning OCaml.
It wasnât passing tests until I restarted Travis yesterday. Also, Ben Bales has change requests that havenât been addressed. The PR is here:
Thatâd be great. Just to be clear, when you say âaccuracy of full Bayesâ, does that mean thereâs similar expected error in expectations given similar compute time budgets?
You mean on the marginal posteriors? Obviously thereâs no covariance (unless we start implementing Tamaraâs methods [Iâm assuming thatâs where your link led earlier]).
I assume they confided in you privately. Are you allowed to say what they are using it for? Otherwise, I have no idea what to do with this information. Lots of people are happily running BUGS in a single chain with no convergence diagnostics at all, but I hardly want to take that as a sign that we should be doing it!
It started as more narrow concept, but then extended and turned to be not well defined.
Andrewâs original GMO idea uses the gradient of the joint from autodiff, and then tries to do the marginalization without explicit form for the integral. INLA and LA can use the gradient of the marginal density directly as it has analytic form.
I think you got the most important parts. Below are combined answer to many later questions
- INLA is both and algorithm and software
- INLA is acronym for integrated nested Laplace approximation
- LA part uses Laplace approximation to compute marginal density, and we can also compute gradients of the marginal with respect to not marginalized parameters
- LA part requires custom work per model family, but that custom work covers very wide range of common models used e.g. in rstanarm and brms and which could be used also as building blocks in plain Stan. We could think these also as re-usable compound lpdf/lpmf functions
- I part uses integration to integrate over the not-marginalized parameters. INLA software uses central composite design (CCD) for this integration. CCD requires much less marginal density evaluations than MCMC, but doesnât work well beyond 15 dimensions. It doesnât work well beyond 15 dimensions, because it gets really difficult to find the typical set in higher dimensions. Stan would use MCMC instead (slower but scales to higher dimensions).
- N part uses nested Laplace approximation, also known as Tierney-Kadane-Laplace approximation to compute non-Gaussian marginals of the marginalized-parameters. Direct marginal from Laplace approximation would be Gaussian, but using another level of Laplace approximation it is possible to evaluate marginal density (usually in 1D) in a grid and then extrapolate rest. In many cases N part is not needed, but it is there for more difficult cases. We donät yet have plans for adding N part in Stan. Instead of nested Laplace, we could use importance sampling (and I have used this approach in GPstuff)
- ILA would be then integrated Laplace approximation
Even the mean field VI can sometimes get very close (in the sense of able to get arbitrary accuracy with importance sampling correction)
The diagnostic will tell the approximate convergence rate with increasing number of draws if the approximation is used as an importance sampling proposal distribution. We can estimate Monte Carlo error and how many more draws we need for the desired accuracy. Monte Carlo error estimates are described in âPareto smoothed importance samplingâ arXiv preprint arXiv:1507.02646. and they have been used in loo package since version 2.0
The cases where we would like to use Laplace approximation, we would expect to get similar error in less time. Of course this will hold only for the certain set of models, which just happens to be very popular.
In those cases where mean field it is accurate enough, we can get the covariance using importance sampling. Tamaraâs method would be great help in cases where importance sampling with mean field proposal fails. And after using Tamaraâs method we would check the improved approximation again with importance sampling diagnostic.
I know they are using it for that kind of models where they know that the posterior is simple and there is a lot of data. I can ask more, but I expect the answer to be vague as they are people working in companies.
Thanksâthat clarifies a lot.
OK. Thatâs the part I still donât understand properly.
That would presumably be when thereâs not a lot of posterior correlation. Otherwise, in high dimensions, the draws to importance weight wonât be close enough.
Thatâs fantastic. And also thanks for explaining it in terms I can understand.
Thatâs neat, too. Could we also use dense ADVI (I think itâs called full rank or something in the interface)?
What happens with things like varying curvature in the posterior?
And thanks so much again for taking the time to lay down simple explanations of all this. Itâs really helpful.
The nested Laplace part is the most complex in INLA, and it includes several shortcuts and approximations to get a significant speed-up but maintaining a good accuracy compared to basic Tierney-Kadane-Laplace approximation. One thing is that it is practical only for improving 1D and 2D marginals. Other thing is that in many cases it is not needed.
It seems dense / full rank ADVI is much more difficult to get to converge (also noted by Giordano, Broderick, Jordan), and no-one seems to know why. Understanding of VI algorithms is still far from understanding of Laplace and MCMC.
Any Gaussian approximation will fail at least when the number of dimensions increase.
No problem. Iâm happy if my explanations are helpful.
Yet, variational inference is the most popular thing out there because it can be thrown at big problems with neural network models. This suggests we should think about methods for approximately validating and fixing VI.