Vague Algorithm Roadmap

As seen in the discussion within https://discourse.mc-stan.org/t/draft-stan-roadmap/6179 and Some issues that may arise regarding transformations there is plenty of differing opinions (although I believe that they are mostly due to misunderstandings rather than any substantial conflicts) about the current state, immediate future, and far future of algorithms in Stan. As the algorithm lead (leader, tzar, cihuacoatl, or whatever the we end up calling it) I wanted to lay out my current vision of these circumstances.

Current State

Currently Stan exposes MCMC with dynamic Hamiltonian Monte Carlo, MCMC with static Hamiltonian Monte Carlo, penalized maximum likelihood, and ADVI.

The MCMC code is relatively mature with the biggest need collecting and exposing common diagnostic methods as discussed in Analysis API, with some initial work by @roualdes.

The algorithms implemented by the optimization code are mature but the code itself needs significant refactoring and improved testing.

Similarly the ADVI code is a sloppy and needs significant refactoring and testing. Additionally the algorithm itself is fragile and prone to failure. Refactoring would certainly help to isolate the various heuristic subalgorithms within ADVI and facilitate a more principled study of the various choices that might lead to a more robust algorithm, as would inclusion of diagnostics as recently worked on by @yuling and @avehtari. Regardless there is a ton of work to be done here and given the fragility of the underlying algorithm I would certainly not oppose removing it from the interfaces until it can be significantly improved.

Immediate Future

Given our general user base I believe that we have need to focus on algorithms and user experiences that are both robust (i.e. they work on a wide class of problems) and diagnostic (i.e. they inform when they’re failing). For example, we can’t just add importance sampling corrections to an inaccurate algorithm if the algorithm can’t reliably get close enough for the corrections to be viable. If every fit ends with khat warnings them people will ignore the warnings or give up on Stan! Note that the current state of ADVI would fall strictly outside of this focus.

In my opinion the only algorithms with the potential for inclusion within this focus are variants of INLA. That said, exactly how these methods might be exposed is uncertain. Because INLA works well on a particular set of models it might not be well-suited to use as a general algorithm but rather a default algorithm in a higher-level interface like rstanarm or brms where the class of models can be limited. Another possibility is defining marginal probability density functions that use INLA algorithms internally to approximate a high-dimensional marginalization.

In any case, while there is a promising future for incorporating methods developed for INLA into Stan but there is still plenty of uncertainty into how exactly that might proceed and hence plenty of research to be done. The work being done by @charlesm93 and @anon75146577 is particularly important.

I’m not including Riemannian Hamiltonian Monte Carlo here because of its dependency on higher-order autodiff, but even once we figure that out there are a few adaptation/diagnostic issues that linger. Similarly Adiabatic Monte Carlo has one missing algorithmic piece that obstructs its general implementation.

Far Future

Basically, everything else. One of the things that has made some of the recent discussions confusing is that some people are talking about preliminary, general research while others are talking about more mature, targeted research.

In my opinion if an algorithm does not have thorough pseudocode then it cannot be considered for inclusion into Stan and hence belongs in the far future discussion. This includes GMO, EP, and other heuristics that have been proposed. This does not mean that these algorithms will never be a part of Stan or aren’t worth researching (I’m reserving my persona judgement here) – it just means that they are not mature enough to warrant the technical overhead of discussing how they might be included in Stan. Maintaining this separation (i.e. “I’m working on this algorithm” or “I’m experimenting with this new algorithm I read about” verses “My colleague came up with a new algorithm – when will it be in Stan?”) will make it much easier for everyone to understand and participate in these discussions.

Stan is first and foremost a tool for users. It also happens to provide a powerful toolkit for researching and developing algorithms, but that use case becomes problematic when it compromises the first priority. For example, exposing algorithms that haven’t adequately been tested is problematic, as is unilaterally promising ill-defined or underdeveloped algorithms to users.

Again I think most of us are on the same page but confused because of the haphazard language of which we are all guilty. In the future please try to frame algorithm discussions into the basic themes “I want to fix an existing algorithm (here are experiments)”, “I want to propose an explicit algorithm (here are experiments)”, or “I am researching a new algorithm if anyone wants to check it out (here are experiments)”. Okay, technically the parenthetical aren’t necessary but they’re always highly encouraged.

8 Likes

Weird question: Are there parts of the algorithms code that could be exposed to improve this type of development? For example, if I had a Gradient Descent Nonlinear Solver (like the simplest type) for finding an optimum are there any tools that would make it easier to generate reasonable tests within the Stan framework? Would it be useful to build a vignette on how to implement a super simple algorithm like that as a potential Stan algorithm?

3 Likes

This is related to the needed refactor of the optimization code, which right now is hard to read and duplicate with additional optimization algorithms. A good, well-tested refactor would make it clear where to drop in a new solver (or maybe just a new line search, etc) and have tests that could be mirrored. In general vignette’s would be great, but they’ll be limited without a clear code base to facilitate the process. Right now we could do a very general “this is how you use the model instance in C++” vignette.

Can you give an indication of what needs to be refactored (or is it everything)? Because it would be a good thing to split up and try to do (especially for code that works and that should be fairly simple, like the optimization code)

Ideally the optimization code could be factored into separate modules roughly organized as

  • High level solver template that takes in a model instance and a state and returns an updated state.
  • Specific implementations for L-BFGS, in particular storage for the persistent memory needed.
  • Common code for line search methods, trust region methods, etc that could be used by any implementation.
  • Common code for convergence monitoring to determine when to an updated state is sufficient at the solver terminates.

Ideally this would make specifying a new algorithm a bit like Legos – you choose a solver skeleton, the specific line search that you want, the convergence check that you want, etc and then snap them all together with templates or the like. Moreover each of the (reusable) component pieces could be then tested independently.

3 Likes

I think you and I are on the same page here. But you’re right that not everybody is given the constant barrage of requests to add new algorithms and lack of interest in what you bring up next:

I completely agree here.

That’s also my understanding from talking to @anon75146577. I believe he’s evaluating both of these options.

I agree on the general point here. I’m now totally confused about what “GMO” refers to. Aki tells me it’s just about anything. But there’s a paper with crisp-ish pseudocode and the more precise version I wrote so I could understand what’s going on.

I think we could refactor our algorithm code to make it more generally useful. That would require decoupling things like L-BFGS from the Stan model class and pulling out and generalizing the gradient descent solver in ADVI (though I don’t know when you’d ever want to use such an algorithm if the L-BFGS algorithm was available unless you had a really huge data set, in which case you’d need stochastic gradient descent, anyway).

There are two things worth teasing apart here:

  1. An algorithm that we will expose to users such as NUTS or L-BFGS.

  2. An algorithm that we will make more useful within the code base.

I’m not sure which you’re talking about, nor am I sure how to write up either. It’s just general software engineering applied to implementing new services calls. You try to give your algorithms useful and general APIs, which is a lot of judgement and guesswork and isn’t free compared to just implementing what you need.

Is there something special about Stan that you think we need to call out?

1 Like

These are priorities one and two for me. This was a huge struggle when HMC, L-BFGS, and ADVI were being implemented. New algorithm authors want a short path to seeing their work go live.

@syclik did a ton of work for L-BFGS and ADVI trying to make it modularizable and testable; @betanalpha eventually did the same for HMC (which has the nicest design of all these components, in my opinion).

Yup! Lots of good research going on there.

The version that uses third-order derivatives has been around since the early TMB days – there’s a nice paper on it, long before “GMO” was introduced into the lexicon. The exact definition of “GMO” has oscillated between that algorithm and others, so it’s been a bit of moving target hence a lot of the confusion!

There’s a version of GMO that uses third-order derivatives?

What is “TMB”?

There’s a version of maximum marginal likelihood that uses third-order derivatives. This was one of the first suggestions for “GMO” and one that has been around for a while. See, for example, https://arxiv.org/pdf/1509.00660.pdf.

Template Model Builder, the successor to Automatic Differentiation Model Builder.

I don’t know why I didn’t have that context queued up given that they do Laplace approximations. Especially since I went to a conference and hung out with its original and current developers!

All these acronyms make me feel like I’m having to simultaneously solve crossword puzzles while figuring out what someone’s saying. “TMB” I should’ve gotten, though.

When I framed what I understood as GMO, it looked like MCMC-EM, gradient-based versions of which was discussed in the original Wei and Tanner (1990) JASA paper, “A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation.”

[edit: Yikes, I’m just as bad as everyone else in terms of acronyms!]

Yeah, there are three big components that are often confused here. One is approximately marginalizing out the latent terms, which is what the Laplace approximation does. The second is computing the gradient of the marginalized density for the hyperparameters. The last is what inferences to make for the hyperparameters – lm tries to find the maximum marginal likelihood without gradients while others, such as the TMB approach, uses the gradients to find the maximum marginal likelihood more efficiently. For additional context, INLA approximates marginalizing over the hyperparameters with numerical integration.

Acronyms and the general branding of algorithms is a plague that confuses users. That’s why I’ve avoided renaming each iteration of the dynamic HMC implementation in Stan with some new name, and in general try to stress the differences between general concepts (what the algorithm does and the different steps needed to do that) and a particular implementation (how those steps are implemented both mathematically and algorithmically).

I really appreciate this and understand that it’s a reputation hit in the wide world, which wants new algorithms all the time.

Like you, I’m pushing back against new algorithm inclusion without extensive validation. After ADVI, it’s sort of a fool me once, shame on you, fool me twice, shame on me situation.

2 Likes