Integrated Laplace approximation roadmap

Tagging @avehtari, @anon75146577, @seantalts, @betanalpha.

Latent gaussian models

Suppose we have a latent Gaussian model
\phi \sim \pi(\phi) \\ \theta \sim \mathrm{Normal}(\mu(\phi), \Sigma(\phi)) \\ y_{j \in g(i)} \sim \pi(y | \theta_i, \phi)

When doing a nested Laplace approximation to marginalize out \theta, we can show that \log p(\phi | y) \approx log p(\phi) + \alpha, where:

\alpha = \log p(y | \theta^*, \phi) - \frac{1}{2} \left ( \log \mathrm{det}|\Sigma_\phi| + \log \mathrm{det}|H| + [\theta^* - \mu(\phi)]^T \Sigma^{-1}_\phi [\theta^* - \mu(\phi)] \right)

where \theta^* is the mode of p(\theta | y, \phi), and H = \nabla^2 \log p(\theta | y, \phi), with the Hessian taken with respect to \theta.

Function specification

The mode is found using an algebraic solver, and the Hessian is analytically known for certain observational models, notably poisson, normal, binomial, and negative-binomial. For these cases, we can write a Stan function that automates the calculation of \alpha. A Stan model might then look as:

functions {
  // functor which return individual elements of vectors / matrix
  vector mu_phi(real phi, real y, int i) { }
  matrix sigma_phi(real phi, real y, int i, int j) { }
}

data {
  int M;  // number of groups
  int N;  // number of observation
  int/real y[N];
  int index[N];  // group for each observation
}

parameters {
  real phi;
}

transformed parameters {
  // vector mu_phi = ...;
  // matrix sigma_phi = ...;
  static vector theta_0[M] = to_vector(rep_array(0, M));
}

model {
  phi ~ p(.);

  target += laplace_lgp_poisson(theta_0, mu_phi, sigma_phi, phi, y, index);
  // or
  target += laplace_lgp_binomial(theta_0, mu_phi, sigma_phi, phi, y, index);
}

Remark 1: the laplace functions take functor as an argument, so itā€™s a bit of pain to implement them. It might be worth waiting for Stan 3 and use lambda functions.

Remark 2: sigma_phi and mu_phi are functors that compute the individual elements of a matrix. This allows more localization, improves memory, and generates smaller autodiff trees. See the function spec for Gaussian processes.

Remark 3: in many cases, \Sigma_\phi is sparse.

Finding the curvature

We would like to generalize to a broader class of likelihoods. The issue is computing H = \nabla^2 \log p(y | \theta, \phi) - \Sigma_\phi. Automatically computing the Hessian of an arbitrary function can be costly, though it is feasible. In practice (according to Dan), modelers work out the Hessian and provide it. We could allow users to specify the Hessian, and pass it to laplace_lgp_general.

Finding the mode

Finding the mode is a high-dimensional root-finding problem. Since the problem is (approximately) convex, a candidate approach is to use a Newton solver, in particular the KINSOL implementation, see earlier post. In particular, we want to consider parallelizing the solver, which can be done separately from calculating the autodiff.

We also want to use an initial guess \theta_0 (or theta_0 in the code), based on solutions in previous iterations, to gain speedup. Sean and I discussed this a little bit. Note this is applicable to any problem that uses the solver. Basically, theta_0 is declared as a static variable, and persists across iterations. This could be a reasonable scheme if the leapfrog steps are small enough when simulating Hamiltonian trajectories.

static vector theta_0 = 0;
theta_0 = algebra_solver(system, theta_0, parm, x_r, x_i);

See link. To safeguard against misuses, we could only accept static variables as the argument for certain functions (e.g. initial guesses).

6 Likes

What if thereā€™s more than one system being solved in the same model? For instance, letā€™s say we need to do a solve per patient in a pharma model. How are you going to associate this variable to the proper problem to be solved?

I donā€™t know of a way to enforce that in C++.

What if thereā€™s more than one system being solved in the same model?

Presumably, you use a different initial guess for each system. So I might declare a static matrix with each row corresponding to a patient.

I donā€™t know of a way to enforce that in C++.

Not sure either, I need to think about this.

Is the idea that this static thing is part of the Stan language? Otherwise, I donā€™t see how to roll this into the C++ API without something like an open-ended map.

Some updates: over the past two weeks (beginning of the glorious Helsinki post-qual era), Iā€™ve developed several testing frameworks and improvements for the prototype nested Laplace approximation.

The ā€œgoodā€ news: for a latent gaussian variable with dimension 500, the old prototype takes ~143 seconds to evaluate and differentiate the log density. The new prototype takes ~0.5 second. With this improvement, we should have a clear case where nested Laplace + NUTS is much more efficient than plain NUTS.

The ā€œbadā€ news: well itā€™s still slow. @avehtari tells me it can be an order of magnitude faster.

Iā€™m looking for some pointers on how to progress from here. The current approach is described in this updated math appendix. The updates from the old to the current prototype are mainly due to:

  • lemma 5, which improves calculation of the log density.
  • using nested forward autodiff inside the Newton solver
  • Precomputing the precision matrix when using the Newton solver

The relevant files are:

  • the specialized newton solver (solver and system file).
  • the test file, which includes the computation of the log density.

A few things I have not tapped in (either because Iā€™m not sure how to best do it or because it would take a long time to implement):

  • kernel tricks for Gaussian processes
  • Cholesky decompositions to manipulate the covariance matrix
  • persistent initial guesses
  • sparse matrices

Tagging @anon75146577 and @betanalpha.

8 Likes

Can you try coding up persistent initial guesses and seeing how much that helps for a couple of models? I think dangling performance improvements in front of us would help motivate figuring out how to do that sooner :) [edit] to do this you might just try hand-editing the generated .hpp file to add a static variable to keep track of the initial guesses.

1 Like

Can you try coding up persistent initial guesses and seeing how much that helps for a couple of models?

A more immediate test would be to measure the impact of a ā€œgoodā€ initial guess on the solver. So, start with a value of the global parameter \phi_1, and solve the optimization problem to get \theta^*_1. Then sample a new value \phi_2, and use \theta^*_1 as an initial guess. I did some preliminary tests a while ago, and got the number of iterations in the solver to go from 15 to 3, but the values of \phi were all within the typical set.

I do wonder if one good initial guess is sufficient. Say you get \theta^* using the initial value of \phi. If thatā€™s ā€œgood enoughā€, then you wouldnā€™t need the persistent initial guess. But apparently, the latter plays a big role in INLA. Iā€™ll try it out and share the result.

All of these attempts scare me if I think how this should work with threaded codeā€¦ we probably need to consider that once we decide to apply this approach.

Which attempts? Are you talking about the static variable? Or are there more components you find concerning?

math appendix does not open (at least for me)

The static variable will kill you if this thing runs threadedā€¦ that scares me.

ā€œkillā€ meaning making it thread local static would kill performance? I think weā€™d need to benchmark that and if itā€™s not worth it in a threaded context, not use that variable in that context or something, right?

You need to be careful when threading is on. A static thread local variable also wonā€™t do if we go with the tbb at some point. The function must be reentrant safeā€¦thatā€™s the right term I think.

Okay, no one is putting anything definite in, but testing performance improvements with a static local variable gives us an upper bound on performance gain we can use to figure out the ROI of an investigation into making this work.

My bad. The address I gave links to a private repo.

Hereā€™s a link to a public repo: https://github.com/charlesm93/presentations/blob/master/Laplace_approximation/MathAppendix.pdf

1 Like

Another update.

1. Further speedup

I found the missing order of magnitude. There were several steps towards it.

  1. There is a specialized method, described in Rasmussen and Williamsā€™ Gaussian Processes for Machine Learning to compute the marginal density and differentiate it: algorithms 3.1 and 5.1. This is the method implemented in GPstuff. It is very clever, and allows us to pool computation from the newton iterations and the differentiation. But, it is a huge pain to code and debug.
  2. The convergence criterion for the Newton solver: in my first implementations I used the norm of the gradient of the objective function. This criterion is strict, and costly, since the norm needs to be computed at each step. Alternatively, one can look at the difference in the objective function, which already gets computed in the Newton iteration (so you get the check for free). Problem: itā€™s not as precise. In my experiments we still get the answer, but we need to work out how well this generalizes. Further diagnostics can implemented.
  3. Last but not least: clever manipulation of matrices ā€“ lazy evaluation, casting matrices as lower triangular, etc. I find using Eigenā€™s code directly instead of Stanā€™s wrapper sometimes is more efficient.

For a system with 500 latent Gaussian parameters, evaluation and differentiation cost ~ 0.13 seconds. The exact time depends on the specifics of the data and the covariance function. I started experimenting with a bernoulli logistic likelihood and a squared exponential kernel.

2. API

The implementation works for all covariance function and likelihoods. The C++ function takes in structures that specify the covariance function and the likelihood. The former is differentiated via fwd mode autodiff. For the log likelihood, I hand-coded the first, second, and third-order derivatives required by the algorithm (another massive pain). The current prototype for the Stan language is:

target += laplace_marginal_bernoulli(theta_0, phi, x, n, y, K, tolerance, max_num_steps);

where

  • theta_0 is the initial guess for the mode
  • phi the global parameters
  • x the data input for the covariance function
  • n, y the sufficient statistics for the likelihood
  • K the covariance functor
  • 'tolerance, max_num_steps` tuning parameters for the Newton step.

Next on the agenda

Testing, testing, testing. The hardest is testing the gradient. In the examples we care about, itā€™s hard to get analytical derivatives. Right now, I benchmark against finite diff and my first implementation. They all agree, but there is some discrepancies (1 - 10%). So, is the approximation good enough?

The Newton step can be improved, for instance with backtracking, checking the mode is well approximated, etc. The Kinsol solver has a ton of fantastic features, some of which might carry over (a collateral of this work is that weā€™ll have a great new algebraic solver).

So far, none of what was done required substantial changes to Stan, and we still get a decent algorithm. Iā€™ll test it for model fitting.

A more complicated feature is the persistent initial guess.

Per @avehtariā€™s suggestion, it would be good to have taping when differentiating the covariance matrix (in particular by re-expressing it in terms of the covariance function). Right now, the generated expression tree is very redundant.

3 Likes

I think this should have cov mentioned as we might want to have a function for precision function argument, too

1 Like

Would it make sense to use the cheap one until convergence is flagged and then swap to the expensive one until it is also satisfied?

Edit: can you point to the git branch? Iā€™m excited to look at the code!

1 Like

I think this should have cov mentioned as we might want to have a function for precision function argument, too

Ideally we would overload the function, though Iā€™m not exactly sure how the compiler can distinguish between the covariance and precision argument.

Would it make sense to use the cheap one until convergence is flagged and then swap to the expensive one until it is also satisfied?

Yes, I think this is the way to go. However, it is unclear how much this affects the gradient. In the example I ran, using \theta^*, the ā€œexactā€ mode found by Kinsol or \theta^l, the approximate mode, did not change the gradient.

Also, getting that additional precision can be costly. In the logistic example, the Newton iteration fails to converge to the mode, when the convergence criterion is |\nabla f| < \epsilon. Rather it dances around the mode. You need an adaptive step size to solve this. The problem is wasting computational power for a marginal gain. What we need is a convergence criterion based on the final gradient (w.r.t the global parameters). Iā€™ll try to work that out.

Edit: can you point to the git branch? Iā€™m excited to look at the code!

Hereā€™s the relevant file. This is still in prototype mode, so not quite what the final PR will look like.

Hi all,
Some more updates.

Prototype in Stan

Iā€™m working with a prototype and getting nice improvements over Stanā€™s HMC (an order of magnitude for a GP problem with dimension 1000, more or less depending on which parameters you look at) and consistent posteriors. Iā€™d like to expand the experiment and Iā€™d love a case where non-centered parameterization doesnā€™t alleviate the hard geometry.

Itā€™d be good to get a series of interesting examples and have GPStuff (+ finite diff maybe) as a benchmark for the unit tests.

API

We need to further hash out the API. First of all:

  1. The signature of the covariance function: should it be matrix f(real phi[], real data[], int data_int[]) where phi is used to pass parameters. This is very general (see the algebraic and ODE solvers), albeit difficult to manipulate.
  2. What tuning parameters does the user control: the threshold difference between the log density (tol_dens), the max number of iterations (max_steps), and the threshold difference between the gradient of the log density (tol_grad). The latter constitutes the more robust convergence criterion. That said, Iā€™m inclined to have an option where the latter is not used.
  3. Name of the function. I like laplace_marginal_logistc, but I could also roll with laplace_marginal_cov_logistic.

Edit: letā€™s not rush to conclusions about the convergence criterion. Iā€™m trying to work out something elegant that controls changes in \nabla_\phi \log p(y | \phi).

4 Likes