L-BFGS-B comment

optimization

#1

I’m not sure how many of you follow R-devel, but the conversation about optimrx included this link about “updating” L-BFGS. Has this, or something similar been done to the version ensconsed in Stan? Thanks.

http://users.eecs.northwestern.edu/~morales/PSfiles/acm-remark.pdf


#2

I don’t even know what R-devl is, but Stan doesn’t depend
on any R code so changes to R won’t affect our code base.

  • bob

#3

Understood. The point was that there was a correction made by Nocedal et al to the original BFGS code to prevent issues with machine precision, and I just wanted to let y’all know in case you were not aware.


#4

I wasn’t aware.

I’m cc-ing Marcus, who wrote our L-BFGS implementation,
in case he’s not on this new mailing list.

  • Bob

#5

Just to understand more of the Stan lib design and to make it easier to know where one can dig for improvement.

These questions reflect where I stand concerning my comprehension of the approach, maybe I am totally wrong.

1/ The only place where deterministic optimization is used is for log likelihood optimization ? (the “optimize” of CmdStan)
2/ In Stan all optimization problems are unconstrained because the same changes of variables that transforms the ELBO domain of definition to R^n, are used everywhere (“optimize”, “variational” and “sampling”) ?
3/ All in all, deterministic unconstrained optimization is not a critical part of Stan, because variational and sampling methods are much more important?
4/ ELBO maximization is performed by stochastic gradient and this part is more critical and difficult (having a general approach for auto tuning…) but yet there is still no need for constrained methods?

In other words, to contribute to Stan, you would be more happy with another stochastic gradient implementation than another deterministic one?

Vincent


#6

vincent-picaud Developer
October 10
Just to understand more of the Stan lib design and to make it easier to know where one can dig for improvement.

These questions reflect where I stand concerning my comprehension of the approach, maybe I am totally wrong.

1/ The only place where deterministic optimization is used is for log likelihood optimization ? (the “optimize” of CmdStan)

That’s where L-BFGS is applied. It’s deterministic
other than initialization.

2/ In Stan all optimization problems are unconstrained because the same changes of variables that transforms the ELBO domain of definition to R^n, are used everywhere (“optimize”, “variational” and “sampling”) ?

Not quite. You can define a constrained problem and you can
still optimize it, but it will require initialization within
support. This can sometimes work with sampling, but is more
stable with optimization.

3/ All in all, deterministic unconstrained optimization is not a critical part of Stan, because variational and sampling methods are much more important?

I think that depends on who you ask. I think most uses are
for sampling, second most uses for optimization and third most
for variational inference. I think this is largely due to our
users (mostly statisticians) and due to our lack of understanding
of variational inference.

The non-determinstic part of variational inference is in the
calculation of the gradient, not in mini-batching. That is, it’s
not a stochastic gradient.

4/ ELBO maximization is performed by stochastic gradient and this part is more critical and difficult (having a general approach for auto tuning…) but yet there is still no need for constrained methods?

Not in the release version of Stan. There was an experimental
version used for the paper, but I’m pretty sure the stochastic
version isn’t built into Stan or acccessible from the interfaces.

In other words, to contribute to Stan, you would be more happy with another stochastic gradient implementation than another deterministic one?

We’d be happy with anything that makes any of our systems faster
or more robust. Or adds new systems we haven’t thought about.

Andrew and Dustin are working on max marginal likelihood (which may
itself involve some stochastic components much like variational inference),
and Andrew and Aki and a whole crew of others are working on expectation
propagation, which is like variational inference, but optimizes the
reversed form of KL divergence. Neither of these have any Stan code
yet as far as I know—certainly nothing merged into the develop branch
of the repos.

  • Bob

#7

Many thanks for your detailed and quick answer Bob.

1/ OK

2/ I did not realize that, thank you.

3/
a/ I am really surprised of that. I mean I thought the user preference was Sampling or Variational and that point estimates like MAP were only here for convenience. I really like the Variational approach and the fact that it scales well with problem dimension (maybe because I have a deterministic background). Maybe it would be interesting to know which approach Stan users use the most.

b/ I understand that the randomness in the variational approach essentially comes from the Monte Carlo approximation of the involved integrals. But still this makes the objective function a random variable that needs “stochastic” optimization methods, is it right or I miss something?

Concerning Stan development, thank you for your detailed answer that tells us what is going on.

Concerning optimization, the Barzilai-Borwein methods (aka spectral gradient) are really competitive and easy to implement. I had some good experiences with them:

For an introduction:
http://www.math.ucla.edu/~wotaoyin/math164/slides/wotao_yin_optimization_lec09_Baizilai_Borwein_method.pdf

Some “stochastic” extensions are now emerging but I have not tested them:
https://arxiv.org/abs/1605.04131v2

I have not too much time to dive into Stan code, but I really like the project, I hope that my comments are not “noise” in this forum.

Vincent


#8

On the users list and in the interface documentation (at least in RStan’s) we’ve tried hard to emphasize that ADVI is experimental, so that probably contributes to why it’s not used as much as penalized MLE. I think there’s a lot of interest in ADVI, especially given how fast it is, but we’ve been warning people against using it for final inferences in general (which I think is warranted given what we’ve seen when testing it).


#9

If someone could forward me the details, I’d be happy to quickly take a
look at it. I’m not aware of any specific precision related updates at
this point.

Cheers,
Marcus


#10

The link to the PDF in the first post in the thread may be all you need, I think.

Avi


#11

Ah, sorry, I haven’t used the new forums yet and the link to the PDF didn’t
make it through to the email.

I took a quick look and about half of the changes there don’t apply to us
since we don’t use the bound constraints with L-BFGS-B . The other half
might be relevant, I need to figure out what the difference in machine
precision definitions are between C++ and Fortran. I’ll try to find time
to look at it in the next week or two.

Cheers,
Marcus


#12

If it would help to see revised Fortran code, it can be found as version 3 here (link warning :) ): http://users.iems.northwestern.edu/~nocedal/lbfgsb.html


#13

The linked update was to L-BFGS-B (the bounded constraint form)
and was from February 2011.

  • Bob

#14

Technically, our optimizer finds penalized maximum likelihood estimates.
They’re not MAP estimates because we throw away the Jacobian in the
implicit priors on the transformed parameters. We’ll probably be
adding a proper MAP estimator soon.

Andrew likes to think of all these methods as approximate Bayes.
The penalized max likelihood gives you a Laplace approximation from
which you can gauge uncertainty in the same way as variational
inference, with the difference being centering on the mode vs.
an approximate mean.

There are lots of cases where optimization won’t work in theory
(because there’s no MLE as in a typical hierarchical regression
model) but variational inference would work (because there
is a posterior mean).

I tend to think of “stochastic” methods as meaning ones that
stream over data, like stochastic gradient descent (which can
be deterministic if you don’t randomize mini-batches).

I’m not sure what you mean by needing stochastic optimization
methods. The ADVI paper also talks about using stochastic
variational inference (in the streaming data sense), hence my
confusion about what we’re talking about.

I didn’t go into details, but that BB method you cite sounds
like it has the same motivation as the L-BFGS method we currently
use, which also approximates inverse Hessian-vector products using
gradients without computing full Hessians.

If comments are truly noise, we just ignore them.

  • Bob

#15

Technically, our optimizer finds penalized maximum likelihood estimates.
They’re not MAP estimates because we throw away the Jacobian in the
implicit priors on the transformed parameters. We’ll probably be
adding a proper MAP estimator soon.

Thank you for the clarification. Things are clear now.

Andrew likes to think of all these methods as approximate Bayes.
The penalized max likelihood gives you a Laplace approximation from
which you can gauge uncertainty in the same way as variational
inference, with the difference being centering on the mode vs.
an approximate mean.

I like this too. I guess it is another place where 2nd order AutoDiff would be useful for Hessian computation.

I tend to think of “stochastic” methods as meaning ones that
stream over data, like stochastic gradient descent (which can
be deterministic if you don’t randomize mini-batches).

I do agree and explain the confusion. On my side I use “stochastic” as soon as the objective or a constraint is a random variable.

I didn’t go into details, but that BB method you cite sounds
like it has the same motivation as the L-BFGS method we currently
use, which also approximates inverse Hessian-vector products using
gradients without computing full Hessians.

I do agree about L-BFGS. The basic idea behind BB is a very crude approximation of the Hessian by \alpha.I_d where I_d is the identity matrix. This seems very brutal, but the method exhibits very fast convergence, comparable to the conjugate gradients method. Yet, it is not well understood theoritically, convergence is not monotone for instance. I always have been happy with this method. It is very cheap concerning memory and computation, easy to implement (at least for the convex case, for the nonconvex one a quite unusual non-monotone line search must be used to have some guarantees conserning convergence). For these reasons, personnaly it is the first method I test when facing a new optimization problem. I have no idea concerning the behavior/qualities of the cited “stochastic” version.

Vincent


#16

vincent-picaud Developer
October 11
Technically, our optimizer finds penalized maximum likelihood estimates.
They’re not MAP estimates because we throw away the Jacobian in the
implicit priors on the transformed parameters. We’ll probably be
adding a proper MAP estimator soon.

Thank you for the clarification. Things are clear now.

Andrew likes to think of all these methods as approximate Bayes.
The penalized max likelihood gives you a Laplace approximation from
which you can gauge uncertainty in the same way as variational
inference, with the difference being centering on the mode vs.
an approximate mean.

I like this too. I guess it is another place where 2nd order AutoDiff would be useful for Hessian computation.

Exactly. We’re using finite differences over the gradients until
we get our 2nd order autodiff more thoroughly tested.

And the reason we want to get a proper MAP (or “posterior mode”) estimator.
Should be trivial to plumb through. I only realized the difference
when working through MLE vs. MAP vs. Bayesian estimators in one of our
case studies. Michael’s great at teaching us all these things! I’m
just the note-taker and writer :-)

  • Bob

#17

I’ve checked and I’m confident that the issue raised in Nocedal’s note does
not affect the implementation in Stan. Specifically, we use
std::numeric_limits::epsilon whose definition matches the epsilon()
function in Fortran 90.

Also, the optimization routine seeks to find the value x* which maximizes
p(x) where p(x) is the posterior defined by the Stan model and x are the
parameters on the constrained scale. So in the sense that the optimizer
is maximizing the posterior, it is indeed trying to find a Maximum
A-Posteriori estimate of the parameters. However, MAP is a poorly defined
concept for a lot of reasons and the dropping of the Jacobian terms for
optimization is because non-linear reparameterizations of a posterior will
change where the MAP solution is. If y are the unconstrained parameters
and x = f(y), keeping the Jacobian term would be equivalent to finding the
MAP estimate of p(y). But x* != f(y*) so a decision has to be made. I
think optimizing on the constrained scale is the option of "least surprise"
for users, particularly because Stan was designed to keep the constrained
<-> unconstrained transforms hidden from the users and, theoretically, if
we found that a different transform worked better at some later date we
could change it.

Anyway, that’s just a long way of saying that optimization in Stan does
compute a MAP estimate (modulo its ability to actually find the global
maximum). It’s just that a MAP estimate is a somewhat ambiguous concept.

Cheers,
Marcus


#18

@Marcus_Brubaker
Just to be sure that I have understood you well, you say that it is more natural and safe to optimize parameters in their initial domain (the constrained one): for instance max log p(x|mu,sigma), with sigma>0 ?


#19

Given a posterior density p(theta | y), doesn’t everyone take
the MAP to be:

theta* = ARGMAX_theta p(theta | y)

To compute that, wouldn’t we need the Jacobian term? And yes,
the mode is going to be sensitive to parameterization and that’s
the problem Michael keeps pointing out with MAP.

For example, if we declare

real<lower=0, upper=1> phi;

then the Jacobian is what gives us the uniform(0, 1) distribution
on phi. It works out to be:

log_phi ~ logistic(0, 1);
phi = inv_logit(log_phi);

which entails

phi ~ uniform(0, 1)

If we don’t include the Jacobian, what we get instead is
the improper distribution of log_phi

log_phi ~ uniform(-infinity, infinity);
phi = inv_logit(log_phi);

which gives phi a “distribution” with delta functions at
0 and 1.

At least I think that’s what’s going on. I always get lost
and have to start counting on my fingers with this stuff.

  • Bob

#20

Yes, that is the definition of MAP but what is missing the parameterization of theta. Lets say that theta \in P where P is the set of (constrained) parameter values and omega \in R^N are the unconstrained parameters with theta = f(omega). Then:

theta* = ARGMAX_{theta \in P} p(theta | y) = f( ARGMAX_omega p( f(omega) | y) )

Where p( - | y ) in that last expression is the same function as in the middle expression and doesn’t include the Jacobian. Reparameterizing a function to respect the constraints doesn’t change the location of the maximum.

Now, if you wanted to define a new distribution over omega (which is what you need to do if you’re going to draw samples from omega and be able to transform them through f( - )) then you get:

q(omega | y) = p( f(omega) | y ) J(omega)

where J(omega) is the Jacobian term. You can thus also ask for the MAP estimate of omega wrt q( - | y ) , i.e.,

omega* = ARGMAX_omega q(omega | y) = ARGMAX_omega p( f(omega) | y ) J(omega)

but the point is that theta* != f(omega*) in general. Thus, we have to make a choice between whether we’re maximizing p( - | y ) or q( - | y ). I would argue that since theta are the parameters that the user themselves actually defined in the model, that’s the most natural domain in which to find the optima. In Stan, omega are largely parameters that are defined for convenience under the hood to handle constraints. Hence, in Stan we optimized p( - | y) rather than q( - | y ).

And finally, to Vincent’s question, it’s not that it’s safer in the initial (theta) domain, it’s just that it is more aligned with what a user would expect, either implicitly or explicitly.