Addressing Stan speed claims in general

If a dataset is simulated you should have many of them and you should report statistics about the distributions of ESS/sec RMSE etc. This gives a lot more information and doesn’t unfairly penalise one method over the other just because of bad luck with the simulated data. This type of test gives a lot of information about how well a method does across models that are appropriate to the data.

1 Like

Thanks for the feedback. @anon75146577 it’s good to hear from a member of the UofT community. Unfortunately, I had to submit the conference paper but will keep your comments in mind when revising.

Thanks again for your suggestions regarding the Stan implementations we used to evaluate DynamicPPL and Turing. The latest results are not online atm as the respective paper is under review, but we will publish the results and implementations that we used for reproducibility asap.

To clear off a misconception, I read a few times about our paper. The aim of this work was not to misrepresent the efficiency of Stan. Quite the opposite is true, we intensionally evaluated against Stan as we consider it to be the most robust and efficient PPL, this is also mentioned in the paper and the reason why we tried our best in writing efficient Stan models to compare against – but failed to do so in some cases. Fortunately, we now have more efficient implementations thanks to the effort by @stevebronder.

At least the Turing team (including myself) consider Stan to be the most efficient PPL for static models.

Moreover, we don’t think that Turing.jl is faster on static models than Stan. Not sure where this sentiment comes from, our paper clearly states that this is not the case in general. I also don’t think that people use Stan as a benchmark punching bag if anything Stan is considered the most competitive baseline that exists.

I can only talk for the Turing project here. In our case, the code base is almost exclusively developed by passionate people on their spare time. No one is developing for Turing during his/her PhD. We are all from different backgrounds (and continents) and are simply passionate about providing a robust PPL for the Julia community. Just like the Stan team, we are dependent on third-party funding to support our efforts.

We had similar discussions in our team meetings and we heavily rely on Google Summer of Code (GSoC) and Google Season of Docs (GSoD) students. Maybe this could be an option for the Stan team too.

You can use R and Python almost seamlessly in Julia (and vice-versa) through bridges. Julia is a bit like a glueing language that inter-opts nicely with all kinds of existing languages.

6 Likes

Just to throw in some opinion as a non-stats, non-numerics based astrophysics person. I came to Stan from a world where likelihoods are improperly used (everything is an L2 norm), adhoc methods for everything to two-distribution tests, to custom ND-KS test based MCMC run around frequently. No one tests the validity of these approaches in literature and no one is particularly worried if inferences are proper if you get the hype-inducing kick of so-called discovery.

Some of these methods are “fast.” But I’ve messed around a lot with different PPLs/samplers and many of them give you incorrect answers for even the simplest models. I spent some time away from Stan in the -> days playing with PyMC3. I was not well versed in Stan and thought I could more easily implement some of the data specific models (instrument response, etc) there in a pythonic way. Have you ever tried for loops in those theano based PPLs?.. There were some speed differences, but I was frustrated at PyMC3 not accurately fitting simulated data . I recoded in Stan and it worked. It came down to numerical stability. Maybe that has changed, but it was a big lesson to me. I also didn’t like the “freedom” to combine different MCMC methods on different parameters. HMC this, and MCMC that, and Slice this. Where was the assurance any of this would lead to a robust inference??

While it may not be universal, I find most of these packages preaching features over fidelity. Speed over understanding. Hype over robustness.

As I focused on learning Stan, I learned to write rather complex models (coming from a c++ background helps) and this is no longer an issue for me, but it does present a barrier to some. I’ve heard people developing some Julia based HMC frameworks (not Turing) say “you can’t do anything complex in Stan.” This may be an advertising problem. I also am still rather novice on what makes efficient Stan code and while the documentation helps, it is a lot to swallow.

But again, I’m not here for speed, I’m here for robustness and robustness takes time to understand… hype will NOT solve that. One of the best talks about Stan is @Bob_Carpenter 's talk about the low-level parts of the backend https://youtube.com/watch?v=6NXRCtWQNMg. It’s this type of care that brings people that want to do things right to Stan.

I would love faster models, but I would rather have Stan focus on being correct and me learn to write better models than Stan focused on hype. The others that come to Stan are likely coming for the same reason that I come, robustness and a community that is more focused on good model building than the lastest GPU specs that will blow away a tensor flow model on a toy data set.

Stay away from hype and the user community will be the one you want.

11 Likes

I agree entirely, robustness and reliable inference should be the primary concern of any PPL.

The Turing team has only recently focused on working on a more efficient infrastructure as we got criticism about the performance based on sometimes wacky benchmarks by others. Our main concern was, and still is, to provide a robust library that is well embedded in the Julia ecosystem.

That said, speed is a major concern for many people, and it should be taken very seriously—especially given the new applications that are often concerned with large scale data.

I believe that a joint effort of the different PPL teams can substantially help to keep improving the respective systems in terms of reliability and speed. No one is looking for a wacky PPL just to use new algorithms, but ignoring recent advancements is also not the way forward.

3 Likes

I think what should be addressed there is that Stan is a bit of a closed world, which has its good and bad points. The good point is that it’s much easier to know what the universe of available functions are which makes it easier to document. The downside is that you don’t have an ecosystem with 10,000 developers contributing functionality. With Julia’s Turing.jl, we from SciML were demonstrating that Turing.jl + Julia’s DifferentialEquations.jl could be much more performant than Stan, which is not an indictment of Stan and its AD practices, but rather it’s an indictment of the ODE solvers that were chosen. From looking at the files, it looks like there’s really just two people maintaining that portion, and it’s just calling into Eigen and Sundials. While those are respectable solver libraries, we have shown in the SciMLBenchmarks repeatedly that there are usually alternative solvers that are more efficient than dopri5, adams, and bdf (those are quite old methods! None of them are even A-stable above order 2!), among other issues like Sundials adams is quite finicky in terms of reaching tolerance.

So our benchmark demonstrating advantages of Turing+DifferentialEquations.jl over Stan (which BTW were not performed by the Turing developers) is more about demonstrating what happens when you mix lively developer ecosystem (~20 active developers across SciML at a given time) with a full lively PPL ecosystem. I don’t think anyone has really sat down and optimized exactly how the ODE solvers of Stan will work on different subsets of problems, but with DifferentialEquations.jl there’s >300 solvers covering different cases in optimal ways, along with adaptive high-order implicit methods for SDEs, adaptive high order implicit methods for DDEs, etc. These are all extremely well tested (the benchmarks are fairly thorough, are updated roughly monthly through automated systems, and the CI takes >50 hours total if you count all of the convergence and regression testing going on). The Turing.jl community gets this for free by integrating with a language-wide AD system.

When we are timing Stan vs Turing in terms of differential equations, this kernel advantage is what’s demonstrated, and again it’s not an indictment of Stan but rather it’s an issue of focus, expertise, and language: it doesn’t look like a horde of Stan developers wake up every morning to optimize differential equation solvers and improve their accuracy, but in Julia that has been the case for a fairly long time now.

I think the main issue with Stan in this department is really just that the kernels are all baked in. Seeing an FFI (Foreign Function Interface in Stan's future) would be nice, because then external developers like us could add new differential equation solvers to Stan by making it call into Julia. There’s nothing wrong with Stan’s design or Stan’s mission, it’s just not a mission that means you have 10 choices of GPU-accelerated preconditioners for a CFD simulation, and so Stan will not benchmark in those regimes simply due to the implemented kernels. PPLs like Turing have a different mission of allowing a whole general-purpose language be extended in probabilistic ways and integrate with language-wide automatic differentiation, which in turn makes it do well on things which are not part of the traditional sphere of Bayesian inference.

It’s not an advertising problem. Unless the docs and the Github repo are mistaken, Stan doesn’t have a DDE solver, so External delay differential equation solver is a real issue to somebody. It doesn’t have an SDE solver, so Reverse mode SDE idea is a real issue to somebody. It probably doesn’t intersect with the vast majority of Stan users (Stan users probably pre-select away from being the core differential equation modeling crew), but I think calling widely used and effective models as just pure hype is a bit disingenuous to people’s research that doesn’t fall into the currently available Stan kernels.

I think it’s fine to just say that doesn’t fall into the general mission of Stan: Stan does what it does well, and people respect that.

9 Likes

Just to clarify a point. This is the new ODE interface that was added basically two weeks ago and was done by Ben. Many more devs have contributed to the old ODE interface.

4 Likes

A developer of Turing.jl here. I would love to see a community-wide extensive benchmark started by the Stan community that punches back every PPL out there ;) It’s hard to be an expert in optimizing models in every PPL. Naturally, one has to start the benchmarks somewhere. Those benchmarks will by definition make claims about the performance of various PPLs; some will be slower than others. You just can’t make everyone happy! Now if you think inaccurate claims are being made about your favorite PPL, I think the healthy thing to do in this case is what @stevebronder did in the Turing vs Stan case which is contributing optimizations to the benchmark. I think if everyone has good intentions, those contributions will be warmly received. They certainly were in the Turing case. And at the end of the day, I think it’s important to keep in mind that no one is trying to take anyone down. It’s natural to want the PPL you are developing to be better in every way but that’s just not possible sometimes. So imo the ultimate goal of these benchmarks should be to identify strengths and weaknesses of different PPLs. The weaknesses can be worked on and improved and the strengths can be highlighted to attract users and funding. This is how the game works! Finally, I can’t stress enough how much respect the Turing devs have for Stan as a PPL and the Stan devs and community. In many ways, your work has revolutionized the field of probabilistic programming and has inspired many people to join the field so for that we are grateful :)

13 Likes

Chris, thanks for following up about the FFI topic. For my present project, Stan is what my collaborators require for and is the way to go because of the easy log likelihood calculation, so when time is available, I’d definitely like to contribute to the state of the differential equation solvers in Stan (which would also help me finish my projects and, alas, finally allow me to graduate). There is some additional ongoing discussion in this thread about figuring out differential equation development priorities in Stan here: Stan differential equation functionality development priorities.

Because the FFI situation was unknown, I had a growing interest in contributing to an auto-switching ODE solver, which would probably be the algorithm that would be most beneficial to my own projects. However, we did weigh some pros and cons of building solvers versus putting resources instead to forward an FFI. If the Stan devs do decide to put resources toward an FFI after further discussion, efforts towards building native solvers in Stan Math would be a less efficient and redundant goal.

3 Likes

Sorry, I did not mean to imply that there is not principled modeling outside of Stan. Speed of course matters, but my experience is similar to those above that parameterization almost always gains more speed than facilities. Of course, they work together and coordination of the guts of these things is key.

It’s more about the direction of the Stan community as @betanalpha has pointed out. At some point, there will be something better in terms of ease of coding for speed, to the silicon coding, etc. But I’ve seen in my field that speed based approaches/methods/languages typically address correctness later. Again, this doesn’t have to be universal, but it’s a trend and in the academic world it is very bad for early students to learn.

The ODE stuff is a relevant point for sure. Even in astro we do not see such a need for that, but this is highly field dependent and that type of stuff is definitely worth investing. I wonder as I do not know, if the approaches in Stan are based off numeric stabilty etc. One can certainly autogrid through all of Julia, but does it translate to stability for free? I remember being frustrated that Stan couldn’t do high-D interpolation out of the box… but you begin to understand that this is a very bad thing to do with HMC in the first place. In PyMC3 you can sample integers… not a good thing…

Off topic: how much of the things that come for free in jitted languages does not require some of the carefulness that goes into delicate low-level numerical code? I’m asking out of ignorance.

3 Likes

Just wanted to say I appreciate everyone taking the time to chime in here. Thanks to the Stan users and devs as well as the Julia and Turing folks!

11 Likes

I will take a crack at responding here but I will try to be brief not to derail the original discussion too much. I can speak for Julia and Turing. First of all, nothing in Julia is completely for “free” but the asymptotic cost is kind of free because you simply have to pay the cost of developing a library once. Then this library can be used along with an exponential number of combinations of packages with no additional development cost. This talk describes why this super-synergy happens in Julia but not in a language like C++ for example. What this means is that I can run an ODE solver using a high (or low) precision fancy number type on the GPU and embed it in a Turing model to run MCMC on the ODE parameters or perform MLE or MAP using an optimization package without the developers of the ODE solver, the GPU package, the fancy number type, the optimization package or Turing ever trying to make their packages compatible with each other, in the ideal case. In reality, there are a few things to keep in mind when developing a package to make it sufficiently “generic” but they are not many and can be taught to beginners.

What the above synergy does is create a system where a clear separation of concerns tends to happen. The ODE folks can develop and thoroughly test ODE solvers, the number type folks can develop and test number types, the optimization folks can develop and test optimization algorithms, etc. Every one needs to make sure their implementations are correct, numerically stable, fast, etc. So the low level care is there when developing each package but the package is only developed once and then re-used in more ways than the author ever imagined. I hope that answered your questions.

This reply turned out longer than I had liked. To the admins, feel free to move this comment to a separate discussion if you think it’s off-topic.

5 Likes

We’ve been using Stan for DDE and SDE problems for years now, and the difficulty is always parametrizing the model for good posterior geometry.

This is quite the same point that you are making in your own SciML and UDE papers, namely that better model structure is the way forward for machine learning.

Having Stan make use of Julia solvers is a matter of plumbing in the software, either at the Stan math library level or in the compiler (eg targeting Turing.jl).

1 Like

I’ll second that. Sorry I missed the last month of conversation somehow!

I appreciate long-form responses, so thanks!

The result should still give you the right stationary distribution asymptotically. As far as assurances, I think the best you can do is simulated-based calibration for the algorithm and then posterior predictive checks for the fit.

For example, PyMC lets you alternate discrete sampling with Gibbs with continuous sampling with HMC. That’s something we can’t do in Stan.

Our original design was to do just that, but we finally realized that discrete sampling is very problematic computationally (it doesn’t mix very well).

What would be nice to have is some applications of simulation-based calibration to see if it works. The problem we foresaw is that HMC is very sensitive to tuning step size and mass matrix to the geometry being sampled, but that changes every iteration with alternating discrete sampling.

We don’t know how to get around that. Writing an efficient Stan program requires writing efficient and numerically stable arithmetic and also thinking about autodiff execution, which is unusual in its two passes, the second of which is interpreted.

I’ve become convinced that the benchmark we want is wall time to a minimum effective sample size of 200 or so. That’s what most inference demands. For those who need larger ESS, the important thing is ESS per unit time after warmup.

What we find a lot of us is people saying their iterations are faster than Stan without accounting for ESS or even whehter they get the right answer.

Is there something specific here you think we should be investigating (other than recoding Stan in a different language)?

I’ m not sure what you mean here, but some of the PPLs, like Edward, were explicitly motivated as algorithm test beds.

You mean that we require all the functions to be declared in the parser and documented in our documentation?

That’s 10,000 Julia developers over all? How many write the kind of differentiable code you can use in Turing.jl? I really like the basic idea and understand the argument for it.

My understanding is that it’s implemented something like the way “object oriented” code is implemented in R through just overloading functions of the correct signature rather than defining methods on objects directly as in other OO systems like C++ or Java. It’d be super cool if you had a lot of candidates that had robust and efficient autodiff.

How does someone know what’s available as a function in Turing.jl?

There’s the same issue with systems like PyMC3. They’re not really just importing any old Python code any more than Stan can import any old C++ code. It has to be tailored to their Theano back end.

How many people manage to do that? That’s what I was asking when I was asking how many of those 10K developers add robust gradients to their packages? We’ve definitely taken the more Apple-like approach on purpose to avoid shifting this burden to our users and to be able to control the overall experience. We know that becomes limiting in terms of functionality.

In case this wasn’t clear, we’re doing exactly the same thing in using tried and true ODE integrators that others have developed. We’ve evaluated a bunch of systems that people have told us were better than what we’re using, but we haven’t found ones that were better than what we have. I’m not saying Julia doesn’t have better solvers, just that the ones we’ve evaluated in C++ haven’t been great. It’s clear Julia devs think Julia has the best ODE solvers, so one option would be to translate those to C++ for inclusion in Stan. I would think that would result in faster code in the end.

Speaking of which, how many of Julia’s ODE solvers are written in Julia and how many are FFI from C/C++ or Fortran or something else? I have trouble with doc for packages like Julia’s that list 200 choices for ODE solvers—I’m not enough of an expert to wade through all the options.

I guess what I’m saying here is that having too many choices can also overwhelm new users, though many of them just cut-and-paste what their labmates did last year (which means no discovering new functions, but also hopefully not making poor choices).

We have had to think about adding sensitivites, though, since the popular off-the-shelf solvers don’t supply accurate versions of that.

@yizhang has been working on that for PDE solvers, but I’m not sure if he’s looking at a general FFI or just something that’ll allow C++ on the back end.

Right now, we only allow C++ extensions by users.

What’s a CFD simulation?

Right. The stats people don’t do a lot of diff eq solving and the diff eq people don’t do a lot of stats. I work with a bunch of diff eq solving specialists now in my new job.

The Stan developers understand quite well that there’s a vast array of useful models that aren’t so easy to code in Stan. SDEs are one of the things we have a grant at the moment to address. Neural networks are the 800 pound gorilla in this room, and we also don’t have effective ways of fitting those. Systems like TensorFlow Probability really lean heavily on their integration with good NN libraries. (I have no idea where Turing.jl is in that.)

It’s not that, it’s the point you made before—we don’t have 10,000 developers! We’re more than happy to take PRs around all kinds of new models we don’t currently fit efficiently or easily.

It’s come a long way since @syclik and I hacked together the first version that just autodiffed through the RK45 algorithm in Boost!

We’re certainly not trying to do that. We’re not even getting entangled in most of these evaluations because we usually can’t get people to do the evaluations we care about.

My advice is to do what we did for our autodiff paper on arXiv—write to all the developers of those other systems with your benchmark code and ask them if there’s a way to improve it before publishing misleading benchmarks. It’s a lot of work, which is why we usually don’t do it!

I agree completely. The TensorFlow Probability folks did a fairly thorough benchmark recently that didn’t make me cringe (which is saying something, given the number of bad evals I’ve seen in papers and email).

@mans_magnusson has a repo with our current thinking about developing benchmark examples that can be used by other systems. It has the model definitions and 10K draws with an effective sample size of roughly 10K. It’s not yet an official Stan project—it’s still in Mans’s personal repo. Looks like Julia has MCMCBenchmarks.jl. It looks like it somehow tied to McElreath’s book Statistical Rethinking.

Thanks for letting us know. The paper was very dismissive.

That’s the kind of thing that’d be great to add to Stan. All we need to communicate are parameter values/data in and the solutions and sensitivities out, so an interface wouldn’t be too hard.

We’ve tried to be very careful about this and in some cases trade speed for robustness. For example, @Adam_Haber just did a nice comparison of the correlation matrix Cholesky factor transform and derivatives. It turns out TensorFlow’s bijector used a different approach that was faster and worked fine with a lot of data, but was less robust with small data situations. So we decided not to change how we coded that.

Plumbing is hard! Especially if you want things to be easy to install and use and robust in operation across operating systems and platforms.

I think it helps to work on both fronts. Better algorithms aren’t enough by themselves, but they’re an orthogonal concer to better model parameterization (better for HMC being a posterior that’s more standard normal).

6 Likes

Computational Fluid Dynamics. What I’ve done for PDE is a proof of concept, more realistic work has been started by @IvanYashchuk in his petsc project Using PETSc library together with Stan. I don’t think it’s meaningful to jump into large-scale 3rd party simulator before we can clean up Stan’s MPI framework, as @IvanYashchuk has realized.

That’s a PyMC3/Theano issue, mostly because the “standard library” in that case conflicts with numpy usage. Theano, PyTorch, JAX, and TensorFlow have this issue because they have a sublanguage that is separate from how the vast majority of packages are implemented, and only packages implemented with semantics defined in that sublanguage (which is not even close to Python) are differentiable. That’s not a Turing.jl issue. In general, generic algorithms will work in Turing.jl because differentiability happens on the language, its AST, and its original functions as defined in Base and the standard libraries, not a sublanguage. DifferentialEquations.jl was found to be compatible because someone tried it, not because someone actively made it work (their success is actually how I got into the field), and it’s not the only example of that occuring. We actually have a paper on this topic where we slap together rootfinders and ODE solvers to get a trebuchet, and use a quantum circuit library (Yao.jl), and the SDE solvers which are all direct differentiated. A lot of this universal differential equation work is doing its next demonstrations on things like the Clima climate model, which is differentiable by existence. A lot of examples include libraries like BandedMatrices.jl which were written before the AD libraries, not changed for the AD libraries, but are compatible with AD libraries because of how it works on generic dispatches.

So yes, you can’t take “any” old code since there will be limits (mutation is the biggest limitation), but “most” libraries we use are autodiff compatible without the authors even knowing it, so yes there are thousands of people working on libraries that you can use in Turing.jl. Try it!

That’s a separate story… when features come for free, discoverability is a bit harder. DiffEqFlux.jl for example is basically a library which is just there to define a few helpers and document a ton of things that are compatible with differentiation, like neural graph differential equations, because without having a library to search for no one can find out they can do it (though the library itself did not have to be modified for the example to work). Because of this, probabilistic inference of neural graph differential equations will work in Turing.jl, though nobody has thought to do it, but hey free paper for whoever wants to. It is a very interesting question how to document features that are automatically generated by composability and not one that we really have a great solution to other than to write a billion tutorials.

The vast majority are pure Julia. Sundials.jl, ODEInterfaceDiffEq.jl, LSODA.jl, (and the benchmarking libraries MATLABDiffEq.jl, SciPyDiffEq.jl, and deSolveDiffEq.jl) are introduce wrapped methods. The rest of the libraries are Julia libraries, and most of the methods are in OrdinaryDiffEq.jl and StochasticDiffEq.jl (/DelayDiffEq.jl and StochasticDiffEq.jl which are then generated by those other libraries).

That’s why we have a default algorithm chooser: solve(ODEProblem(f,u0,tspan,p)) will work in auto-diff with forward and reverse, and it’ll choose the method for you along with stiffness detection and all of that. The rest is details for people who want to optimize.

Computational Fluid Dynamics, like https://github.com/CliMA/Oceananigans.jl for uncertainty quantification of large-eddy simulations and climate models.

Turing.jl has high order adaptive implicit methods because it supports the Julia language. So https://diffeq.sciml.ai/stable/solvers/sde_solve/ those can be used in Turing. It’s part of the Bayesian Differential Equation tutorials and it will update in a few days to this new version which includes delay differential equations. Stochastic delay differential equations of course then follows by using, and again that’s back to that discoverability problem when features work by default, but that’s a little niche for a tutorial in a PPL library.

Yeah, the hardest part of the plumbing in that case is that the sensitivity equations for differential equations themselves are a differential equation defined by AD calls, meaning that we cannot treat the ODE/SDE/DDE/DAE/SDDE/… RHS function as a black-box (if we want to do it efficiently). We need some way to swap back to a native AD for that RHS. We’re trying to work that out for integrating with PyTorch right now, but it’ll take some work. And we’ll want to do the same with Stan.

3 Likes

+1 to this. Another good example is the data.table benchmarks where they continuously updated their benchmarks and contacted all the other package developers to make sure the code they were testing on is correct/performant. I’d really love to use posteriorDB for something like this

I really like this, I’ve also hark’d at folks before about the “hurdle rate” as a good measure (given N warmup and M samples what is your ESS). Warmup is the real mind killer!

For testing autodiff I think you really just need to rip away all the other modeling stuff and just look at it more like microbenchmarks on the forward and reverse pass.

3 Likes

Also I won’t speak for @mans_magnusson but I’d really like the idea of reaching out to other PPL developers and pitching posteriorDB as a good solution for testing across languages. idk how ready it is for that but if we want something everyone uses then I think it would be good to get feedback from other PPLs

2 Likes

Didn’t expect see CliMA here. I learned CUDA coding for ocean modeling from Frank Giraldo, who’s also in CliMA core team. IIRC Oceananigans is just an incompressible CFD solver. It doesn’t come with UQ capability, and atmospheric modeling is handled by other CliMA components. So it’s a stretch to say it does UQ of climate modeling, and it is as relevant to Stan capability as saying C++ can do CFD.

Hi!

It is one of the issues I try to solve right now. The main problem is that we want to be sure that all models included from different PPLs are exactly the same (give the same/proportional log density for the same dataset) and this need to be done using Continous Integration with Travis.

We started with TFP and PyTorch but there are now some discussions on adding Julia PPLs as well. So, yes, I fully agree with you!

2 Likes