Addressing Stan speed claims in general

Chris, thanks for following up about the FFI topic. For my present project, Stan is what my collaborators require for and is the way to go because of the easy log likelihood calculation, so when time is available, I’d definitely like to contribute to the state of the differential equation solvers in Stan (which would also help me finish my projects and, alas, finally allow me to graduate). There is some additional ongoing discussion in this thread about figuring out differential equation development priorities in Stan here: Stan differential equation functionality development priorities.

Because the FFI situation was unknown, I had a growing interest in contributing to an auto-switching ODE solver, which would probably be the algorithm that would be most beneficial to my own projects. However, we did weigh some pros and cons of building solvers versus putting resources instead to forward an FFI. If the Stan devs do decide to put resources toward an FFI after further discussion, efforts towards building native solvers in Stan Math would be a less efficient and redundant goal.

3 Likes

Sorry, I did not mean to imply that there is not principled modeling outside of Stan. Speed of course matters, but my experience is similar to those above that parameterization almost always gains more speed than facilities. Of course, they work together and coordination of the guts of these things is key.

It’s more about the direction of the Stan community as @betanalpha has pointed out. At some point, there will be something better in terms of ease of coding for speed, to the silicon coding, etc. But I’ve seen in my field that speed based approaches/methods/languages typically address correctness later. Again, this doesn’t have to be universal, but it’s a trend and in the academic world it is very bad for early students to learn.

The ODE stuff is a relevant point for sure. Even in astro we do not see such a need for that, but this is highly field dependent and that type of stuff is definitely worth investing. I wonder as I do not know, if the approaches in Stan are based off numeric stabilty etc. One can certainly autogrid through all of Julia, but does it translate to stability for free? I remember being frustrated that Stan couldn’t do high-D interpolation out of the box… but you begin to understand that this is a very bad thing to do with HMC in the first place. In PyMC3 you can sample integers… not a good thing…

Off topic: how much of the things that come for free in jitted languages does not require some of the carefulness that goes into delicate low-level numerical code? I’m asking out of ignorance.

3 Likes

Just wanted to say I appreciate everyone taking the time to chime in here. Thanks to the Stan users and devs as well as the Julia and Turing folks!

11 Likes

I will take a crack at responding here but I will try to be brief not to derail the original discussion too much. I can speak for Julia and Turing. First of all, nothing in Julia is completely for “free” but the asymptotic cost is kind of free because you simply have to pay the cost of developing a library once. Then this library can be used along with an exponential number of combinations of packages with no additional development cost. This talk describes why this super-synergy happens in Julia but not in a language like C++ for example. What this means is that I can run an ODE solver using a high (or low) precision fancy number type on the GPU and embed it in a Turing model to run MCMC on the ODE parameters or perform MLE or MAP using an optimization package without the developers of the ODE solver, the GPU package, the fancy number type, the optimization package or Turing ever trying to make their packages compatible with each other, in the ideal case. In reality, there are a few things to keep in mind when developing a package to make it sufficiently “generic” but they are not many and can be taught to beginners.

What the above synergy does is create a system where a clear separation of concerns tends to happen. The ODE folks can develop and thoroughly test ODE solvers, the number type folks can develop and test number types, the optimization folks can develop and test optimization algorithms, etc. Every one needs to make sure their implementations are correct, numerically stable, fast, etc. So the low level care is there when developing each package but the package is only developed once and then re-used in more ways than the author ever imagined. I hope that answered your questions.

This reply turned out longer than I had liked. To the admins, feel free to move this comment to a separate discussion if you think it’s off-topic.

5 Likes

We’ve been using Stan for DDE and SDE problems for years now, and the difficulty is always parametrizing the model for good posterior geometry.

This is quite the same point that you are making in your own SciML and UDE papers, namely that better model structure is the way forward for machine learning.

Having Stan make use of Julia solvers is a matter of plumbing in the software, either at the Stan math library level or in the compiler (eg targeting Turing.jl).

1 Like

I’ll second that. Sorry I missed the last month of conversation somehow!

I appreciate long-form responses, so thanks!

The result should still give you the right stationary distribution asymptotically. As far as assurances, I think the best you can do is simulated-based calibration for the algorithm and then posterior predictive checks for the fit.

For example, PyMC lets you alternate discrete sampling with Gibbs with continuous sampling with HMC. That’s something we can’t do in Stan.

Our original design was to do just that, but we finally realized that discrete sampling is very problematic computationally (it doesn’t mix very well).

What would be nice to have is some applications of simulation-based calibration to see if it works. The problem we foresaw is that HMC is very sensitive to tuning step size and mass matrix to the geometry being sampled, but that changes every iteration with alternating discrete sampling.

We don’t know how to get around that. Writing an efficient Stan program requires writing efficient and numerically stable arithmetic and also thinking about autodiff execution, which is unusual in its two passes, the second of which is interpreted.

I’ve become convinced that the benchmark we want is wall time to a minimum effective sample size of 200 or so. That’s what most inference demands. For those who need larger ESS, the important thing is ESS per unit time after warmup.

What we find a lot of us is people saying their iterations are faster than Stan without accounting for ESS or even whehter they get the right answer.

Is there something specific here you think we should be investigating (other than recoding Stan in a different language)?

I’ m not sure what you mean here, but some of the PPLs, like Edward, were explicitly motivated as algorithm test beds.

You mean that we require all the functions to be declared in the parser and documented in our documentation?

That’s 10,000 Julia developers over all? How many write the kind of differentiable code you can use in Turing.jl? I really like the basic idea and understand the argument for it.

My understanding is that it’s implemented something like the way “object oriented” code is implemented in R through just overloading functions of the correct signature rather than defining methods on objects directly as in other OO systems like C++ or Java. It’d be super cool if you had a lot of candidates that had robust and efficient autodiff.

How does someone know what’s available as a function in Turing.jl?

There’s the same issue with systems like PyMC3. They’re not really just importing any old Python code any more than Stan can import any old C++ code. It has to be tailored to their Theano back end.

How many people manage to do that? That’s what I was asking when I was asking how many of those 10K developers add robust gradients to their packages? We’ve definitely taken the more Apple-like approach on purpose to avoid shifting this burden to our users and to be able to control the overall experience. We know that becomes limiting in terms of functionality.

In case this wasn’t clear, we’re doing exactly the same thing in using tried and true ODE integrators that others have developed. We’ve evaluated a bunch of systems that people have told us were better than what we’re using, but we haven’t found ones that were better than what we have. I’m not saying Julia doesn’t have better solvers, just that the ones we’ve evaluated in C++ haven’t been great. It’s clear Julia devs think Julia has the best ODE solvers, so one option would be to translate those to C++ for inclusion in Stan. I would think that would result in faster code in the end.

Speaking of which, how many of Julia’s ODE solvers are written in Julia and how many are FFI from C/C++ or Fortran or something else? I have trouble with doc for packages like Julia’s that list 200 choices for ODE solvers—I’m not enough of an expert to wade through all the options.

I guess what I’m saying here is that having too many choices can also overwhelm new users, though many of them just cut-and-paste what their labmates did last year (which means no discovering new functions, but also hopefully not making poor choices).

We have had to think about adding sensitivites, though, since the popular off-the-shelf solvers don’t supply accurate versions of that.

@yizhang has been working on that for PDE solvers, but I’m not sure if he’s looking at a general FFI or just something that’ll allow C++ on the back end.

Right now, we only allow C++ extensions by users.

What’s a CFD simulation?

Right. The stats people don’t do a lot of diff eq solving and the diff eq people don’t do a lot of stats. I work with a bunch of diff eq solving specialists now in my new job.

The Stan developers understand quite well that there’s a vast array of useful models that aren’t so easy to code in Stan. SDEs are one of the things we have a grant at the moment to address. Neural networks are the 800 pound gorilla in this room, and we also don’t have effective ways of fitting those. Systems like TensorFlow Probability really lean heavily on their integration with good NN libraries. (I have no idea where Turing.jl is in that.)

It’s not that, it’s the point you made before—we don’t have 10,000 developers! We’re more than happy to take PRs around all kinds of new models we don’t currently fit efficiently or easily.

It’s come a long way since @syclik and I hacked together the first version that just autodiffed through the RK45 algorithm in Boost!

We’re certainly not trying to do that. We’re not even getting entangled in most of these evaluations because we usually can’t get people to do the evaluations we care about.

My advice is to do what we did for our autodiff paper on arXiv—write to all the developers of those other systems with your benchmark code and ask them if there’s a way to improve it before publishing misleading benchmarks. It’s a lot of work, which is why we usually don’t do it!

I agree completely. The TensorFlow Probability folks did a fairly thorough benchmark recently that didn’t make me cringe (which is saying something, given the number of bad evals I’ve seen in papers and email).

@mans_magnusson has a repo with our current thinking about developing benchmark examples that can be used by other systems. It has the model definitions and 10K draws with an effective sample size of roughly 10K. It’s not yet an official Stan project—it’s still in Mans’s personal repo. Looks like Julia has MCMCBenchmarks.jl. It looks like it somehow tied to McElreath’s book Statistical Rethinking.

Thanks for letting us know. The paper was very dismissive.

That’s the kind of thing that’d be great to add to Stan. All we need to communicate are parameter values/data in and the solutions and sensitivities out, so an interface wouldn’t be too hard.

We’ve tried to be very careful about this and in some cases trade speed for robustness. For example, @Adam_Haber just did a nice comparison of the correlation matrix Cholesky factor transform and derivatives. It turns out TensorFlow’s bijector used a different approach that was faster and worked fine with a lot of data, but was less robust with small data situations. So we decided not to change how we coded that.

Plumbing is hard! Especially if you want things to be easy to install and use and robust in operation across operating systems and platforms.

I think it helps to work on both fronts. Better algorithms aren’t enough by themselves, but they’re an orthogonal concer to better model parameterization (better for HMC being a posterior that’s more standard normal).

6 Likes

Computational Fluid Dynamics. What I’ve done for PDE is a proof of concept, more realistic work has been started by @IvanYashchuk in his petsc project Using PETSc library together with Stan. I don’t think it’s meaningful to jump into large-scale 3rd party simulator before we can clean up Stan’s MPI framework, as @IvanYashchuk has realized.

That’s a PyMC3/Theano issue, mostly because the “standard library” in that case conflicts with numpy usage. Theano, PyTorch, JAX, and TensorFlow have this issue because they have a sublanguage that is separate from how the vast majority of packages are implemented, and only packages implemented with semantics defined in that sublanguage (which is not even close to Python) are differentiable. That’s not a Turing.jl issue. In general, generic algorithms will work in Turing.jl because differentiability happens on the language, its AST, and its original functions as defined in Base and the standard libraries, not a sublanguage. DifferentialEquations.jl was found to be compatible because someone tried it, not because someone actively made it work (their success is actually how I got into the field), and it’s not the only example of that occuring. We actually have a paper on this topic where we slap together rootfinders and ODE solvers to get a trebuchet, and use a quantum circuit library (Yao.jl), and the SDE solvers which are all direct differentiated. A lot of this universal differential equation work is doing its next demonstrations on things like the Clima climate model, which is differentiable by existence. A lot of examples include libraries like BandedMatrices.jl which were written before the AD libraries, not changed for the AD libraries, but are compatible with AD libraries because of how it works on generic dispatches.

So yes, you can’t take “any” old code since there will be limits (mutation is the biggest limitation), but “most” libraries we use are autodiff compatible without the authors even knowing it, so yes there are thousands of people working on libraries that you can use in Turing.jl. Try it!

That’s a separate story… when features come for free, discoverability is a bit harder. DiffEqFlux.jl for example is basically a library which is just there to define a few helpers and document a ton of things that are compatible with differentiation, like neural graph differential equations, because without having a library to search for no one can find out they can do it (though the library itself did not have to be modified for the example to work). Because of this, probabilistic inference of neural graph differential equations will work in Turing.jl, though nobody has thought to do it, but hey free paper for whoever wants to. It is a very interesting question how to document features that are automatically generated by composability and not one that we really have a great solution to other than to write a billion tutorials.

The vast majority are pure Julia. Sundials.jl, ODEInterfaceDiffEq.jl, LSODA.jl, (and the benchmarking libraries MATLABDiffEq.jl, SciPyDiffEq.jl, and deSolveDiffEq.jl) are introduce wrapped methods. The rest of the libraries are Julia libraries, and most of the methods are in OrdinaryDiffEq.jl and StochasticDiffEq.jl (/DelayDiffEq.jl and StochasticDiffEq.jl which are then generated by those other libraries).

That’s why we have a default algorithm chooser: solve(ODEProblem(f,u0,tspan,p)) will work in auto-diff with forward and reverse, and it’ll choose the method for you along with stiffness detection and all of that. The rest is details for people who want to optimize.

Computational Fluid Dynamics, like GitHub - CliMA/Oceananigans.jl: 🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs for uncertainty quantification of large-eddy simulations and climate models.

Turing.jl has high order adaptive implicit methods because it supports the Julia language. So https://diffeq.sciml.ai/stable/solvers/sde_solve/ those can be used in Turing. It’s part of the Bayesian Differential Equation tutorials and it will update in a few days to this new version which includes delay differential equations. Stochastic delay differential equations of course then follows by using, and again that’s back to that discoverability problem when features work by default, but that’s a little niche for a tutorial in a PPL library.

Yeah, the hardest part of the plumbing in that case is that the sensitivity equations for differential equations themselves are a differential equation defined by AD calls, meaning that we cannot treat the ODE/SDE/DDE/DAE/SDDE/… RHS function as a black-box (if we want to do it efficiently). We need some way to swap back to a native AD for that RHS. We’re trying to work that out for integrating with PyTorch right now, but it’ll take some work. And we’ll want to do the same with Stan.

3 Likes

+1 to this. Another good example is the data.table benchmarks where they continuously updated their benchmarks and contacted all the other package developers to make sure the code they were testing on is correct/performant. I’d really love to use posteriorDB for something like this

I really like this, I’ve also hark’d at folks before about the “hurdle rate” as a good measure (given N warmup and M samples what is your ESS). Warmup is the real mind killer!

For testing autodiff I think you really just need to rip away all the other modeling stuff and just look at it more like microbenchmarks on the forward and reverse pass.

3 Likes

Also I won’t speak for @mans_magnusson but I’d really like the idea of reaching out to other PPL developers and pitching posteriorDB as a good solution for testing across languages. idk how ready it is for that but if we want something everyone uses then I think it would be good to get feedback from other PPLs

2 Likes

Didn’t expect see CliMA here. I learned CUDA coding for ocean modeling from Frank Giraldo, who’s also in CliMA core team. IIRC Oceananigans is just an incompressible CFD solver. It doesn’t come with UQ capability, and atmospheric modeling is handled by other CliMA components. So it’s a stretch to say it does UQ of climate modeling, and it is as relevant to Stan capability as saying C++ can do CFD.

Hi!

It is one of the issues I try to solve right now. The main problem is that we want to be sure that all models included from different PPLs are exactly the same (give the same/proportional log density for the same dataset) and this need to be done using Continous Integration with Travis.

We started with TFP and PyTorch but there are now some discussions on adding Julia PPLs as well. So, yes, I fully agree with you!

2 Likes

+1
I will open a PR at posteriorDB once I have the time for it. But I really like the idea. Regarding sending code to devs of other PPLs, that is exactly what we recently did for the benchmarks of dynamic models.

Yes, I will open a PR soon. Sorry for the delay @mans_magnusson!

Oh! No worries, at all!

1 Like

Good point. Just to remove some confusion about the benchmarks we did for the workshop paper on DynamicPPL, those are solely timing the infrastructure. This is done intensional as DynamicPPL is not handling the inference but only provides an infrastructure for probabilistic inference in static and dynamic models. Meaning, DynamicPPL is doing the bookkeeping and implements a rudimentary DSL, but does not know about any inference algorithm. The purpose of our paper was to analyse only the performance of this infrastructure in order to find possible bottlenecks. The inference algorithms itself are completely independent of it and have been benchmarked and evaluated separately. Thus, we did not want to mix goals here.

This was not a Stan specific comment, I apologise if it seemed that way. This was more of a general remark that I find it personally important.

I agree that Edward and others are PPLs designed for research, and I don’t have an issue with that. I don’t consider Edward a wacky PPL but well purposed for research on VI.

Turing.jl doesn’t have a particular focus on a modelling family. As pointed out, the flexibility of Turing.jl simply also comes from the fact that Julia packages nicely compose and you can, therefore, run the inference on a GPU, use packages that implement SDEs, neural networks or any other interesting class of models such a probabilistic circuits or normalising flows. Initially, Turing was motivated by Bayesian nonparametric models and many of us are still very interested in providing a platform for BNP.

I apologise if it seemed that way. After having a second look at the paper, I don’t quite know what you refer to, but I am very sorry if it seemed that way to the Stan devs.

4 Likes

Thanks for the reply and references. I didn’t see the automatic ode solver in browsing through the library; that’s how R deals with the problem and also Stan, though ours isn’t a default stiff/non-stiff alternating method. We just default to RK45.

Dealing with the autodiff at the language AST level is great. In C++, we can’t quite do that with Stan because most scientific code isn’t written generically enough.

That’s what we did in the autodiff paper.

Given that the warmup and sampling effort is configurable, it might be better to measure log density and gradient evaluations. Or to just reframe everything in terms of time, but that then leans on implementation details.

Sorry—I don’t want to sound super defensive. I thought I’d cut that out of the response. Given that I didn’t, let me try to explain. First, I was specifically referring to “Turing: a language for flexible probabilistic inference”, by Ge, Xu, and Ghahramani.

And don’t get me wrong—it’s not like the paper was claiming we were idiots. It’s just doing the usual paper job of selling itself by painting everything that came before as old fashioned in both design and implementation. The pressure to frame things this way is main reason I dislike writing conference papers and grant proposals. I also wasn’t keen on the framing of the code size comparison.

In terms of specifics, I don’t like that it perpetuates the widely held misconception that Stan can’t do what the paper calls “stochastic branching” (such as conditionals, loops, and recursions). Maybe that impression came about because we don’t allow varying numbers of parameters, so can’t build fully “non-parametric” models like Dirichlet processes other than through approximation. Or maybe it’s because it’s actually a suboptimal way to code a bunch of models that we recommend coding by marginalizing discrete parameters.

The very last paragraph of the paper touches on something I built into the very first design of Stan’s model class:

Furthermore, the current Gibbs engine implementation in Turing is suboptimal and requires a full sweep of the model function in order to update one variable. This inefficiency can be addressed by leveraging dynamic computational graphs to explicitly represent dependency relationships between model variables. We leave this development and application as an avenue for future work.

But we never released it because it would be so slow that we thought everyone would get the impression that Stan was a bad discrete sampler rather than a good continuous sampler. We also realized that marginalization is way better where it’s possible. Discrete sampling is usually combinatorially intractable (NP complete usually) when the discrete parameters can’t be marginalized efficiently. There are some cases in the middle like missing count data that we’d like to be able to handle with discrete sampling, but can’t explicitly marginalize.

We’ve been trying to figure out how to do what they’re suggesting in this last paragraph at our AST level rather than the dynamic computational graph level (at least if that latter thing refers to the expression graph built up by reverse mode). At least for Stan’s architecture, that’ll be much more efficient than working on the expression graph for reverse-mode autodiff (we can’t tape and re-run; we tried and it was slower than just redoing the forward pass form scratch). I’m afraid I can’t find the publications around automatic marginalization of discrete paramerters, but it would’ve been coming from one of Matthijs Vákár, Maria Gorinova, or Ryan Bernstein, who have all been working on similar problems, including automatic reparameterization, semantic lint modes, variable movement, etc.

6 Likes

No worries. I cannot say much about that paper as I joint the team afterwards, but Hong Ge is quite a Stan fanboy himself. So I suppose the wording was unfortunate and trying to “sell” Turing as a new PPL. But that is a speculation.

Yes, I think this is the case. From what I understood through discussions with Hong, he started Turing out of a need to perform inference in rather complex nonparametric models containing recursions. We are not there with Turing.jl as we focused mostly on robustness and correctness for now, but we aim for more advanced models such as nested models and complex nonparametric models.

Cool, I wasn’t aware of that.

True, in Turing.jl we currently put quite a bit of work in automatic computation of full conditionals for Gibbs sampling, but we also recommend to marginalise out parameters if possible.

In our case this is done by manipulate the intermediate representation, which is a lowered form of the AST used for code generation containing information about the data and the control flow, of a Turing.jl model. This is then used to trace the execution and dynamically construct a dependency graph that can be used for automatic computation of Gibbs conditionals, AD or caching of execution path’s for particle Gibbs. At least that is the hope. :) In addition to that we plan to have some simple transformation rules on the AST level that can help to do automatic marginalisation or reparameterization for static models in a more efficient way.

1 Like

7 posts were split to a new topic: Stan backend for NumPyro + performance comparison

There’s an experimental tool that’s used for benchmarks around Turing and Cmdstan that Rob developed a while ago. Here’s the link:

1 Like