Algebraic sovler problems (Laplace approximation in Stan)


#21

Wait… if that Hessian isn’t the hessian of the gradient then the code is
wrong!


#22

Right. I finally got back to my computer. Thanks Ben!

It definitely helps to have the correct Hessian!

I think your Newton-Rhapson shows that this should work, but it still never moves when the algebraic solver is used…

The INLA output, which is probably correct is

r$summary.random$index
ID mean sd 0.025quant 0.5quant 0.975quant mode kld
1 1 -3.0566021648 1.38773127 -6.3399283 -2.85544814 -0.8997090 -2.4983098 8.125335e-05
2 2 -2.8902537155 1.41232884 -6.2281794 -2.68680093 -0.6901427 -2.3249838 7.616223e-05
3 3 1.9670245034 0.11260444 1.7389835 1.96948205 2.1812593 1.9743584 2.714074e-07
4 4 2.0020279732 0.12229802 1.7537683 2.00491725 2.2340800 2.0106482 3.824247e-07
5 5 -0.0005581981 0.34919848 -0.7425281 0.01950996 0.6304028 0.0602952 2.831799e-06
6 6 -3.0566021648 1.38773127 -6.3399283 -2.85544814 -0.8997090 -2.4983098 8.125335e-05
7 7 -2.0366284261 0.63407569 -3.4280091 -1.98238936 -0.9408685 -1.8702534 1.202437e-05
8 8 2.3776197671 0.09176133 2.1927715 2.37926020 2.5532142 2.3825157 1.460870e-07
9 9 3.1615705796 0.10281994 2.9538517 3.16362535 3.3577235 3.1677038 3.680286e-07
10 10 -0.3121380170 0.34805685 -1.0516231 -0.29216134 0.3168289 -0.2515794 3.317243e-06


The 5/50/95% quantiles of sigma are (1.6,2.5,4.5).  (i'm only a little bit sure i'm using teh same priors...)

So ben's model seems fairly correct.

#23

Meant to say that the Newton iteration wasn’t behaving super well either. This seems like it’d be a really easy optimization problem, but in a couple cases I had to knock up the number of Newton iterations to 20 or higher, which I think is really large for such a small problem which such a good guess? Might be worth plotting the functions to be minimized w/ the actual data in there, but it still seems like this should be easy.

I could just be wrong about how many Newton iters is reasonable.

How many iterations I needed was a function of the initial conditions too (I tried either all zeros or all ones).


#24

I’m not surprised that this is super sensitive to the initial conditions!

What may be happening is that at every step in the symplectic integrator you’re computing the gradient of some new target density due to the changes in the Laplace approximation after every update. Consequently you don’t have a consistent Hamiltonian and you don’t get the nice scaling of Hamiltonian Monte Carlo. The problems should manifest in similar ways to stochastic gradient HMC methods.

One way to get at more information is to run CmdStan with

./model sample save_warmup=1 output diagnostic_file=diag.csv

This will save all of the warmup iterations (including the variation in adapted step size!) and diag.csv will include the unconstrained parameter values along with their momenta and gradients which can be extremely informative of problems.

One thing to look out for is the average accept rate not achieving the desired target, or smaller step sizes during adaptation not leading to higher acceptance probabilities. That’s a key indicator of the approximate nature of the gradients being the root of these issues.


#25

I’m surprised that it would be sensitive. Similar model in GPstuff, with maximum of 40 Newton steps, but usually less than 5 because the tolerance is 1e-4, works just fine with HMC and NUTS, and the results are not sensitive to initial values of the Newton iterations. Laplace+HMC for GP+Poisson/Negbin are reported in http://becs.aalto.fi/bml/pdf/vanhatalo_et_al_2010.pdf which includes also the stable analytic gradient equations in the appendix. NUTS results are not reported, but we haven’t noticed any problems when we switched from plain HMC to NUTS. We have several other papers with experiments using Laplace+HMC for GP + Bernoulli, Cox-PH, logistic density estimate, and no problems wit sensitivity unless we constrain the number of Newton iterations to be really small.


#26

Risking showing my terrible calculus skills, if
f(x,s)=const -n*exp(x)-x*exp(s)
(Which is our case in an unconstrained parameterisation) then

fx = -n*exp(x) - exp(s)

fs = -x*exp(s)

So if f(h(x),s)=0 then

dh/ds = x*exp(x)/(n*exp(x)+exp(s))

Which isn’t so ugly. With normal priors on x and s this should be bounded.

I re-ran this model (with a normal(0,2) prior on s). It failed badly.

The only value of the parameter is s=0 (I set the init). The gradients are g_s = =+/- 2195. This seems large.

I’ll see if I can calculate the actual gradient later. It’s doable by hand - the solution to the root finding can be expressed as a LambertW function.


#27

I’m not going to lie. This works much better when you remember that a Normal density looks like.

target += -0.5*laplace_precisions[index[i]]*(y[i]- conditional_mode[index[i]])^2 + 0.5*log(laplace_precisions[index[i]];

It still doesn’t work though. But the gradients are smaller.


#28

It probably also works if you do the approximation correctly. I’ll see if i can fix it in the morning.

long story short: the target should be something like

poisson_log_lpmf(y | conditional_mode) + normal_lpdf(conditional_mode| 0, sigma) - normal_lpdf(conditional_mode | conditional_mode, 1/sqrt(precision_laplace)

so you could say that I missed most of the approximation.


#29

Something does not look right here and it isn’t just that we have an inv_sqrt function to turn a precision into a standard deviation.


#30

Ok. We fixed it. There were two things:

  • We had the wrong target (which is an issue!!)
  • The algebraic solver is quite flakey when it comes to actually solving things.

To do these in order:
1: We had the wrong target

Yes, but it is. That term only contributes a log-determinant. I wrote out the density explicitly for the code below so it looks less weird.

I guess the story is that even though I should know this stuff backwards, I apparently shouldn’t code it quickly!

2: The algebraic solver is flaky.

Instead of initialising to zero, I initialised to a point near the maximum likelihood point. In particular, I took xzero = log( (sums + 0.1) ./ number_of_samples);

It seems that this leads to a stable algebraic solver, at which point everything works. With a bad luck initialisation, you sometimes see one failure, but otherwise it’s fine.

Some things we saw in experiments:

  • The performance depends on sigma. This is worse when you multiply the gradient through by sigma^2. Why?

    • Case 1 (original gradient): The Hessian grows large when sigma gets near zero. This is an area well supported by the prior. The algebraic solver should (and usually does) rein this in and no problems occur. The solver has failed at most once in our tests with the above initialisation.
    • Case 2 (scaled gradient): In this case, the Hessian gets small when sigma is small, which doesn’t allow the optimizer to take large steps. This means it tends to run into the maximum number of steps boundary. We see this happen 5-10 times with the above initialisation.
  • We are sensitive to initialisation for the solve! This is a little surprising because both the function being optimized AND it’s derivative are monotone. This should, in the presence of step-size control, make this thing work. It doesn’t.

    • It almost always fails when initialised to zero. This is a bit weird - it’s not a strange point in the space.
    • It almost always succeeds when initialised as above. With this data those points are -4.61 -4.38 1.97 2.01 0.01 -4.61 -2.15 2.38 3.17 -0.31
      which aren’t really that far from zero.
    • It almost always succeeds (maybe a 1 or 2 failures) if initialised at a +/- 1.0 where the sign is chosen to be consistent with the above.
    • It fails maybe 5-10 times if initialised at +/- 0.1 with the sign chosen the same way.
    • EDIT: If we initialise to the log observed mean, then it also works (modulo occasionally seeing 1 failure).
    • EDIT2: When calculated appropriately, the log observed mean still seems to be a good starting point even when there are unobserved categories.

The output looks to be about correct (roughly consistent with INLA)

Inference for Stan model: output.
1 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=1000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
sigma  2.73    0.04 0.69  1.62  2.23  2.66  3.10  4.27   351    1
x[1]  -3.16    0.04 1.37 -6.05 -4.04 -3.13 -2.17 -0.73  1000    1
x[2]  -2.96    0.05 1.42 -5.90 -3.89 -2.86 -1.99 -0.29   921    1
x[3]   1.97    0.00 0.11  1.74  1.89  1.97  2.05  2.20   865    1
x[4]   2.00    0.00 0.12  1.76  1.91  2.00  2.08  2.24   891    1
x[5]   0.02    0.01 0.34 -0.65 -0.22  0.01  0.25  0.69   825    1
x[6]  -3.12    0.05 1.35 -6.03 -3.95 -3.04 -2.19 -0.64   827    1
x[7]  -2.08    0.02 0.66 -3.39 -2.52 -2.07 -1.62 -0.89  1000    1
x[8]   2.37    0.00 0.09  2.19  2.31  2.37  2.43  2.54  1000    1
x[9]   3.16    0.00 0.11  2.96  3.09  3.16  3.23  3.38   965    1
x[10] -0.30    0.01 0.36 -0.95 -0.55 -0.32 -0.04  0.40  1000    1

Samples were drawn using NUTS(diag_e) at Fri Oct 13 14:28:03 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Now, here’s the rub. This is a fairly easy problem so it was easy to get the initial values close to the conditional mode. How easy this is in general is yet to be seen. Edit: Actually, if the log-sample mean is good enough, then this might not be too bad…

The working code is below. Thanks everyone for your help. Any further comments would be definitely appreciated!

Edit: Simpler initialisation to the log sample mean.

functions {
  vector conditional_grad(vector x, vector sigma, real[] number_of_samples, int[] sums) {
    vector[dims(x)[1]] result;
    result = (to_vector(sums)-to_vector(number_of_samples).*exp(x)) - x/sigma[1]^2;
    return result;
  }
  vector conditional_neg_hessian(vector x,  real sigma, real[] number_of_samples) {
    vector[dims(x)[1]] result;
    result = to_vector(number_of_samples).*exp(x) + 1/sigma^2;
    return result;
  }
}
data {
  int N;
  int M;
  int y[N];
  int<lower=1, upper=M> index[N];
}
transformed data {
  vector[M] xzero = rep_vector(0.0, M);
  real number_of_samples[M];
  int sums[M];
  for (j in 1:M) {
    sums[j] = 0;
    number_of_samples[j]=0.0;
  }
  for (i in 1:N) {
    sums[index[i]] += y[i];
    number_of_samples[index[i]] +=1.0;

  }

 // xzero = log((to_vector(sums) + 0.1) ./ to_vector(number_of_samples));
{ // Beware of empty categories!!!!!!
    int tmp = M;
    real summm=0.0;
    for (i in 1:M) {
      if(number_of_samples[i]==0){
        tmp = tmp-1;
      } else {
        summm = summm + sums[i]/number_of_samples[i];
      }
    }
    xzero = rep_vector(summm/tmp,M);
  }


}
parameters {
  //vector[M] group_mean;
  real<lower=0> sigma;
}
transformed parameters {
  vector[1] sigma_tmp;
  vector[M] conditional_mode;
  sigma_tmp[1] = sigma;
  conditional_mode = algebra_solver(conditional_grad, xzero, sigma_tmp, number_of_samples, sums );
}
model {
  vector[M] laplace_precisions;
  sigma ~ normal(0,2);
  laplace_precisions = conditional_neg_hessian(conditional_mode, sigma,number_of_samples);
  // p(y | x^*) p(x^* |sigma )/p(x^* | sigma, y)
  for (i in 1:N) {
    target += poisson_log_lpmf(y[i] | conditional_mode[index[i]]);
  }
  target += -0.5*dot_self(conditional_mode)/sigma^2 -M*log(sigma) - 0.5*sum(log(laplace_precisions));
}
generated quantities {
  vector[M] x;
  {
    vector[M] laplace_precisions = conditional_neg_hessian(conditional_mode, sigma,number_of_samples);
    for (i in 1:M) {
      x[i] = normal_rng(conditional_mode[i],inv_sqrt(laplace_precisions[i]));
    }
  }
}

#31

We should try to make the algebraic equations solver more stable, but in this situation where the gradient and Hessian are known analytically, would it be better to do Newton iteration even if the algebraic equations solver were perfect?


#32

Definitely. I think this is enough of a “proof of concept” that this type of thing can work within Stan. It was also fun to do a weird stress test on a new feature :p


#33

I think the solver is supposed to be an improvement on a Newton iteration (Algebra solver details?). It’s computing the Jacobian internally using autodiff I think.

Something that I think I’d like would be able to be able to program custom Jacobian’s so that the ODE solver n’ such didn’t have to compute them itself. edit: the Hessian of the likelihood term is the Jacobian of the optimization problem here (just to be clear on the words).

There are plenty of downsides to this though in terms of development/testing. Even if this could be exposed through a C++ hack though…


#34

While Dan was struggling with Stan and and algebraic solver in my office, I tested with similar model in GPstuff, Even Dan’s simplified example is apparently simple, it seems to be nasty. I’ve have never before seen similar random error as it’s also now sometimes giving in GPstuff. Most of the time it works and uses just 7 Newton iterations and a half-step and HMC/NUTS works just fine. With GP models I haven’t seen this error before (or I would have fixed it). Dan was experimenting also with Stan+GP+Poisson+Laplace, but it got stuck. I guess we’ll keep experimenting.

And as Dan also wrote, we tested this because solver is using an algorithm which is really close to Newton, and we hoped to illustrate that it would be relatively easy to have Laplace approximation (without need to autodiff through the Newton iterations)


#35

I think a lot of this is that you pre-multiplying the gradient by the covariance matrix in GPStuff. This makes the performance of the solver more sensitive to the value of the random effect standard deviation than it should be. (It makes sense in a GP context because it saves a linear solve for every gradient calculation, but in general it’s not a good idea.)

We can still use hte formula for the derivative at the mode to avoid autodiffing through the Newton.


#36

Hi,

I’ve been digging into this conversation and @Daniel_Simpson’s code, after realizing I was still pretty confused about this whole Laplace business.

The biggest source of confusion to me is the target we end up implementing:

Would you mind sharing the long story?

Also what exactly is the advantage of using the Laplace approximator in Stan? My guess is we’re gaining speed by only computing the posterior distribution for some parameters (here sigma) and using the posterior mode for other parameters (here group_mean). Ok…

But what about using a normal density instead of a log poisson? Is it faster to simulate samples when the likelihood has a normal density? Maybe this is simply a toy example, and the real problem is to approximate a more complex density with a normal density – thereby simplifying the posterior distribution.


#37

Ok. It goes something like this.

Consider a model like this:

\begin{align*} y &\sim p(y | x,\theta) \\ x \mid \theta &\sim N(0,Q(\theta)^{-1})\\ \theta &\sim \pi(\theta) \end{align*}

The most important thing to note is that, because p(x,y,\theta) = p(x \mid y,\theta)p(\theta \mid y ) p(y) it follows that

p(\theta \mid y ) = \frac{p(x,y,\theta)}{p(x \mid y, \theta) p(y)} \propto \frac{p(y\mid x,\theta)p(x \mid \theta) p(\theta)}{p(x \mid y, \theta)}.

This identity hold for every x. The only problem is that except in special cases (like a model with a Gaussianl likelihood), we don’t know p(x \mid y, \theta).

The trick that we use is to approximate the conditional p(x \mid \theta,y) by a Gaussian that matches the location and curvature at the mode. To do this we need to find

x^*(\theta) = \arg \max_x p(x \mid y,\theta)

and compute the Hessian of p(x \mid y,\theta) at x^*(\theta). A quick calculation shows that this Hessian is

Q(\theta) + H(\theta),

where H_{ij} = \frac{\partial^2}{\partial x_i \partial x_j} \log( p(x \mid y,\theta)). The Gaussian approximation is then

p(x \mid y,\theta) \approx N(x^*(\theta), (Q(\theta) + H(\theta))^{-1}).

We then use the above expression for p(\theta \mid y) evaluating the RHS at x=x^*(\theta) and get

p(\theta \mid y) \propto \left(\frac{|Q(\theta)|}{|Q(\theta) + H(\theta)|}\right)^{1/2} \exp\left(-\frac{1}{2}x^*(\theta)^TQ(\theta)x^*(\theta) +\log p(y \mid x^*(\theta),\theta) \right)\pi(\theta).

(This conversation is way easier to have with MathJax enabled!)
(Edit: Fixed the missing square root in the final equation)


#38

Additional comment to what Dan wrote: This works especially well when likelihood p_\theta(y|x, \theta) is smooth and strongly log-concave. This includes likelihoods of location parameter of most exponential family distributions, which means a bunch of very commonly used models. It can be useful also for other likelihoods, but that’s a longer story.


#39

@Daniel_Simpson Thanks for working out the details. I went through the calculations. It seems the first term should have a square root:

\left(\frac{|Q(\theta)|}{|Q(\theta) + H(\theta)|} \right)^{1/2}

which makes you pick up a 0.5 term when you write the log density in Stan. In your code you have it for the denominator but not for the nominator, which is simply coded as -M*log(sigma). Unless I’m missing something of course.

This works especially well when likelihood p_\theta(y | x, \theta) is smooth and strongly log-concave.

Is there a good reference that works these things out in detail? It seems doing this kind of approximation is pretty pitfall prone, especially if the tail of p(x | y, \theta) doesn’t decay like the tail of a Gaussian.

I’m still trying to understand where the efficiency gain comes from. My guess is the expression for the posterior p(\theta | y) with the approximation is much simpler than what we would otherwise have, so doing one iteration is relatively fast.


#40

Yes. Missed the sqrt. Will edit when I get a chance.

The canonical reference is Tierney and Kadane (1986). Also the INLA paper has a bit of a chat about it.

Approximation is quite stable. It relies on the conditional being almost Gaussian which happens in low and high data regimes.

Some likelihoods / sampling schemes that are less informative than a Gaussian (eg a Bernoulli ) has problems. But otherwise it’s pretty good.

The main efficiency gain is that theta|y is low dimensional regardless of how high dimensional x is and doesn’t have a funnel.