Scalable Bayesian multilevel modeling

One bottleneck with the nested Laplace approximation is doing a high-dimensional convex optimization. We don’t quite have the tools for this in Stan yet. I’m using a 500-dimensional system, which arises in the latent gaussian poisson process (see again this conversation) as a testing case. I’ve gone back to only focusing on C++ code and all I’m doing is solving an algebraic equation and computing sensitivities. So far:

  • Powell dogleg solver requires ~ 2000 s
  • general Newton solver ~ 350 s
  • general Newton using precision matrices ~127 s
  • specialized Newton solver with analytical gradients ~43 s

Here’s the code for the performance tests.

The precision matrix trick means doing everything in terms of the precision matrix instead of using the covariance matrix. If you look at the code, it makes a lot of sense. I tried rewriting the specialized Newton solver with precision matrices, but here, something slows down the whole process, and I’m not sure why. Also, computing sensitivities with precision matrices turns out to be quite expensive, but in theory, it shouldn’t be.

My guess is these unexpected costs are related to memory management.

I guess these are total time over many repetitions?

I see… you want to move to known ground to have more control on the model and also better posteriors, but most importantly you want the flexibility of stan.

Makes sense.

Is a brute force approach with a parallel reduce an option? On aws you get 96 core machines if thats an option to use (doesn‘t mean 96x speedup, but we can go far with threading).

Yes, I think brute force is definitely part of the solution here. We’ll always be getting to problems where brute force is not enough, but it makes sense to use brute force where we can.

I guess these are total time over many repetitions?

Unfortunately not. These are the times for solving the equation once and computing the corresponding sensitivities. I’m not sure how many times the system is solved per MCMC iteration, but the above implementations are prohibitively slow. They scale badly with the number of latent parameters. With dim(\theta) = 100, your runtime is less than 0.2s.

The silver line is that we can shrink these costs. Without going into details, I now have it down to 0.4 second for dim(\theta) = 500 and 0.005 s for dim(\theta) = 100. But before pushing harder on this front, I want to work on other components of the log density.

I have said it before… but you should by all means explore IDAS from sundials. That software has been developed since 2001 by the Lawrence Livermore National Laboratory labs… and they know how to solve this stuff fast. So this is around for 18y now - so it is certainly mature (but maybe clunky to work with in a modern C++ code, ok).

The adjoint method is probably the way to go if I read their intro, see below. In particular it sounds as if you can get the needed gradients per output in independence of the others => you can parallelise that using threads!

Quote from IDAS homepage:

SUNDIALS: SUite of Nonlinear and DIfferential/ALgebraic Equation Solvers
IDAS

IDAS is a package for the solution of differential-algebraic equation (DAE) systems in the form F(t,y,y’,p)=0 with sensitivity analysis capabilities (both forward and adjoint modes).

IDAS is a superset of IDA and hence all options available to IDA (with the exception of the FIDA interface module) are also available for IDAS.

Depending on the number of model parameters and the number of functional outputs, one of two sensitivity methods is more appropriate. The forward sensitivity analysis (FSA) method is mostly suitable when the gradients of many outputs (for example the entire solution vector) with respect to relatively few parameters are needed. In this approach, the model is differentiated with respect to each parameter in turn to yield an additional system of the same size as the original one, the result of which is the solution sensitivity. The gradient of any output function depending on the solution can then be directly obtained from these sensitivities by applying the chain rule of differentiation.

The adjoint sensitivity analysis (ASA) method is more practical than the forward approach when the number of parameters is large and the gradients of only few output functionals are needed. In this approach, the solution sensitivities need not be computed explicitly. Instead, for each output functional of interest, an additional system, adjoint to the original one, is formed and solved. The solution of the adjoint system can then be used to evaluate the gradient of the output functional with respect to any set of model parameters.

1 Like

@wds15: agreed, we need a well-written and tested solver. What about KINSOL, as discussed in previous posts? Are there reasons to prefer IDAS to KINSOL?

You’re right that which autodiff mode we use plays a crucial part. Here, differentiation is done via the implicit function theorem, combined with a few analytical results. Only the Jacobian of the system, f, with respect to the parameters gets computed with autodiff: with rev-mode this takes > 40 s, with fwd-mode < 1 s. But I think we can do better with rev mode and the right initial cotangent, per @betanalpha’s suggestions. This takes one sweep, see details here.

Parallelization in general would be great – and one of the benefits of KINSOL. But even “before”, there are still steps to improve the optimization. In any case, down the road we want to use the solver from KINSOL or IDAS.

IDAS states on its homepage that it exactly solves your problem (right?) which is why i would use that. My guess is that idas is even more optimized for this type of problem than kinsol, but I am not familiar with either, just my intuiution.

My problem, as I currently have it framed, involves solving a regular algebraic equation, whereas IDAS is designed for differential algebraic equation. Ok, the former is a special case of the latter, but I don’t see any other connection.

Exactly. We’ll always be getting to problems where no existing computing is enough. So we’ll never have “scalable Stan” or “fast Stan”, we’ll always be dealing with comments like “Stan is slow” and “Stan can’t solve my problem” from our potential users like Andrew. It’s the way of computer science.

I’d like to point out that we’ve already improved BUGS/JAGS by something like three orders of magnitude in speed and scalability for models that BUGS/JAGS could fit (two from basic engineering and HMC, one from parallelization).

I’m pretty nothing will ever lower the priority “make hierarchical models faster” from Andrew’s perspective. It’s been his number one priority since Matt and I started working for him in 2011 and has never left that spot. Three distractions raise their heads on a recurring basis:

  • the Monster model (“Monster” is someone’s name), an ODE that he and Frederic Bois fit in the 90s on software that’s dominated by Stan. I’m pretty sure Stan could fit it if anyone could recreate the model from the paper or old code, but Andrew insists “Stan can’t fit the Monster model” every time it comes up because nobody has. I keep thinking he’ll find an intern to do it. None of us have the inclination because nobody understands the model.

  • the birthday problem, a GP that Aki fit in GPstuff and Andrew’s been bugging us about fitting in Stan ever since. This one’s going to require serious development, but that’s all been put on the back burner for now in that nobody’s working on it as far as I know.

  • some Stan cloud thing that only Andrew understands his requirements for, so we’ve never been able to make any progress on. I think RStudio Cloud may have satisfied him, but we’ve run out of free cycles and don’t exactly have funds to pay for free Stan cycles.

2 Likes

Hey, Bob. There are lots and lots of things I want to do with Stan beyond the list above! But I understand we can’t do them all at once.
Regarding the items above:

  • Yes, you’re right that there are many orders of magnitude to go before Stan is fast enough that I would be satisfied. Right now we have users who just don’t even try to fit models in Stan because they would take forever (weeks, maybe?). So they’re fitting various approximations and simpler models. Let’s say a model of interest would take a week and a half to fit. One order of magnitude takes that down to a day, which is do-able if (a) you only want to fit the model once or twice, not in some sort of big loop, and (b) you know ahead of time what model you want to fit, so no exploration needed. Another order of magnitude takes it to 2.4 hours, which is better but still makes it difficult to do exploration. Another order of magnitude takes it to 2 minutes, which is just about fast enough that the user can act in a nearly interactive way by playing with models and data.

It’ll take awhile to get 3 more orders of magnitude, but in the meantime I’ll take what I can get, which is why all avenues are important, including brute force, approximate models, and approximate algorithms.

  • The reason I’d like to fit the Monster model (from my 1996 paper with Bois and Jiang) in Stan is that I used it as a teaching example, and it’s worked well in that capacity for many different audiences. So it would be wonderful if, after hearing the Monster lecture, students could go home and fit the model themselves in Stan, play around with the model and data, simulate fake data, etc.

  • The birthday model is another popular example, and it’s on the cover of BDA. Again, it will be great if the model can be fit directly in Stan, for a few reasons: First, the model has problems and I’d like to improve it. Second, the model has other applications than birthdays. Third, I know that we’re working on improved computation for Gaussian process models, and this is a great demonstration example.

  • For the Stan cloud thing, I’m very happy with what Rstudio implemented: https://statmodeling.stat.columbia.edu/2018/10/12/stan-on-the-web-for-free-thanks-to-rstudio/
    If paying for cycles is an issue, I’d be happy with a demonstration version. This is not a high priority for me but I wouldn’t be surprised if at some point it will happen, given how much software is runnable on the web right now.

1 Like

Hi Charles!

I just looked once more and you are right - KINSOL is what you want. That one uses Newton or Picard iterations. Before investing too much work you could poke around with their examples? Maybe one of these can be shaped to sizes you are interesting in to see if KINSOL is worthwhile to follow up?

Sebastian

Sounds good. Last time I gave this a try, I struggled a bit with KINSOL’s API, but I’ll take another dive and work through their examples.

Yeah… not surprising that you struggle with that old C library beast… but that’s why I say to modify their examples first. That is hopefully a lot easier and you should get out of that already an answer if it is worthwhile to do the work.

So what do you want to prioritize?

How is this not saying do them all at once?

We’re blocked on that because nobody other than you understands the model or the data. If you can write the model down as a sequence of sampling statements (LaTeX or chalkboard is fine) and write the diff eqs down, and you can get the data into an R list, then I’d be happy to code it and fit it.

I’m not sure how to measure popularity, but setting that aside, we just need programming effort in order to be able to scale GPs. The design’s been done since last summer in Helsinki.

Then there was the approximation that Aki’s team was working on. Did you not like that one? It could fit the birthday problem in Stan, just not the exact same GP.

Anyway, my basic point is that we can’t work on every avenue—we need to prioritize. Personally, I’d rather prioritize the engineering we know how to do, like scaling GPs. I’m not sure what Andrew wants to prioritze.

1 Like

Hi, see responses below.
Andrew

andrewgelman:
but in the meantime I’ll take what I can get, which is why all avenues are important, including brute force, approximate models, and approximate algorithms.

How is this not saying do them all at once?

I do want to do all these at once, but as part of solving this problem, not as part of core Stan. Considering the three items I listed above:

(a) Brute force: That’s existing Stan features such as MPI and GPU, and to-be-implemented features such as improved adaptation and hard-coding of certain hierarchical models.

(b) Approximate models: That uses existing Stan also. “Approximate models” is what people do all the time, when they want to fit big model X, but that would be too slow so they fit smaller model Y.

(c) Approximate algorithms: The plan is to implement these outside of Stan (with the exception of ADVI which is already in Stan), for example in R calling Stan, and then if they work well on these real problems, we might want to put them into Stan or build more formal R and Python wrappers for them.

I think it’s ok to do (a), (b), and (c) at once:

(a) is already being done by Stan developers

(b) is already being done by Stan users

(c) we will do as needed

andrewgelman:
The reason I’d like to fit the Monster model

We’re blocked on that because nobody other than you understands the model or the data. If you can write the model down as a sequence of sampling statements (LaTeX or chalkboard is fine) and write the diff eqs down, and you can get the data into an R list, then I’d be happy to code it and fit it.

It might make sense for me to do this with Charles when he returns in the fall, as he has experience with these differential equation models.

andrewgelman:
The birthday model is another popular example,

I’m not sure how to measure popularity, but setting that aside, we just need programming effort in order to be able to scale GPs. The design’s been done since last summer in Helsinki.

Then there was the approximation that Aki’s team was working on. Did you not like that one? It could fit the birthday problem in Stan, just not the exact same GP.

Aki and I discussed this. It’s not urgent, we can just talk about it during our next visit to NY.

Anyway, my basic point is that we can’t work on every avenue—we need to prioritize. Personally, I’d rather prioritize the engineering we know how to do, like scaling GPs. I’m not sure what Andrew wants to prioritze.

I support your thoughts on priorities for Stan engineering. I also like application-driven projects. There are people whose focus it is to get these hierarchical models running faster, they can do what is necessary. If at some point there are internal Stan improvements that could help a lot, then we could discuss priorities. But at this point I think that the scheduled improvements in Stan computing and Stan language will help in this and other problems.

It might make sense for me to do this with Charles when he returns in the fall, as he has experience with these differential equation models.

Happy to do it!

I get confused when I see things like the above followed by things ike the below:

My main concern is about opportunity costs of staffing. Any time spent on approximation algorithms is time we don’t spend on MCMC.

By “core Stan” do you mean the C++ in the math, stan, and cmdstan repos? The line that’s expensive to cross is releasing things to users you want to support. If we do that as part of Stan (including in RStan or another package we support like loo), it’s a commitment to our users.

Is anything urgent?

That’s usually where I need to start before thinking about generalizations.

2 Likes