Algebraic sovler problems (Laplace approximation in Stan)


The main efficiency gain is that theta|y is low dimensional regardless of how high dimensional x is and doesn’t have a funnel.

Right. So in the model with the laplace approximation, we have:

parameters {
  real<lower = 0> sigma;

whereas in the original model, we have:

parameters {
  vector[M] group_mean;
  real<lower = 0> sigma;

So we’ve reduced the effective number of parameters in our model. This almost seems too good to be true.


No free lunches. The evaluation of the log-posterior is more expensive (check out those determinants, they don’t need to before collapsing the random effect out).

But some lunches are over-priced.


Ok, so to check if I understand this correctly:
(1) the efficiency gain comes from reducing the dimension of the model parameter, which is crucial when x is high-dimensional. In addition, we’re also more likely to evade certain geometric pathologies such as tight funnel shapes.
(2) the trade-off is the evaluation of the log posterior is more expensive.

Overall, we still expect a speed-up (provided the Laplace approximation is done efficiently). This is not too dissimilar to the mixed solver used for solving ODEs…


There’s also the issue that the approximation induces a bias in the corresponding posterior expectation estimates. For the circumstances mentioned above this bias is small and manageable, but it can be hard to quantify in general hence the initial focus on specific models where the approximation is know to work well.


@bgoodri Rereading this, I’m unclear on what motivates your comment. Why would we expect the Newton iteration to succeed if the algebraic solver is working poorly?

I’m debating whether the problem at hand requires we debug the algebraic solver or implement a new feature.


In the situations in which one might want to use the Laplace Approximation, you may be able to achieve faster convergence with Newton.


The cases we are interested the funnel is also flattened and then twisted and in high dimensions this is nasty and means that HMC needs to take small steps. But the conditional part is almost Gaussian and after integrating that out the remaining marginal is also much closer to Gaussian.


I’m not sure which conditional Aki is talking about. p(\theta \mid y) is not assumed to be almost Gaussian - only p(x \mid \theta,y).

The funnel isn’t evaded or flattened. The problem is changed into one that doesn’t have funnel (or other pathologies).


Sorry being so unclear that you completely misunderstood what I tried to say. I know that we completely agree on this and it’s just now a problem with words. The joint parameter space (without any marginalization) has a shape of twisted flattened funnel. Most of funnels I have seen in real word have circular cross-section, but now this one is flattened because of the correlations between latent values and twisted because the correlations change and instead of being a 3D object it’s in higher dimensions and thinking of it twists your mind.

I completely agree, p(\theta \mid y) is not assumed to be Gaussian, but it is closer to Gaussian than the joint distribution (that’s what the marginalization does). I didn’t say that Gaussian would be a good approximation for that part, too, but since it’s closer to Gaussian, HMC/NUTS or CCD work much easily.

And since we agree on this, if you find the above still confusing you can ignore it or just describe in your own words about the shape of the joint distribution and why it’s more difficult than in 8 schools example.


Thanks for the post! This is great!
I got a question abou the stan code: the laplace_precisions computed above in laplace_precisions = conditional_neg_hessian(conditional_mode, sigma, number_of_samples);, is that H(\theta) or Q(\theta) + H(\theta)?


It’s Q +H. The Q bit is 1/sigma^2.


@Daniel_Simpson Can you use the Laplace Approximation to integrate the latent \nu = t_{\alpha,\beta}\left(y\right) out of the bivariate density derived in the appendix of,33 ?


No idea sorry. I doubt it. I can’t see anything that it’s asymptotic in. (Something needs to go to infinity for the LA to work)


It’s probably worth tagging @avehtari in here. He may see something I didn’t.


It looks as if numerical integration of the bivariate density might be better. I will start a new thread about that.


That would be my guess, too.


I ran some tests, varying the dimension of the latent gaussian variable. For M = 10, specifying the full model performs much better than using a laplace approximation (with crude Newton solver doing better than Powell/dogleg method). When scaling up to 100, regular model still does best, Powell is slow, and Newton unstable. For M = 500, regular model continues to do well, while both Powell and Newton are unstable. I can share more details if anyone is interested.

My guess is that for the Laplace approximation to yield a gain in efficiency, the latent Gaussian must be high-dimensional. But this demands a high-dimensional algebraic solver, which we don’t seem to have. I’m looking into gradient descent, and quasi-newton methods, but I wanted to see whether anyone had suggestions. IMLA uses the former right?


Aki and I are both there in 2 weeks. Let’s meet early in the piece. (I’m there from the 25th)


I’ll be there also from the 25th. Meanwhile, can you share your LGV model?


I’d guess this is initial conditions stuff?

If you’re not using results from the last HMC step as your guess for the new step, I’d be curious if things would work if you did. @bgoodri pointed out you can use static variables and C++ hackery to do that: Is it possible to access the iteration/step number inside a Stan program? . If you are doing this, I’m out of ideas :p.

It’s my understanding that the problems of Newton iteration are all about how close you are to your solution (check out #2 and #4 here: