Addressing Stan speed claims in general

I think a page or section linking to Michael Betancourt’s case studies and guides is the way to go as an initial medium. This could be an expansion of the “Case Studies and Notebooks” section of the Documentation page. There could also be a separate “Case Studies and Examples” tab added to the website navigation bar, which I think would make things more accessible. But yes, since things are heading off-topic, I’m going to create a new thread for discussion about website modifications.

There is in fact a case study on Stan’s website that covers the tricks we used and principles we applied; and also an arxiv preprint with the same content :) See in particular section 3. To some extent, it’s mostly a matter of peeking inside the black box and understanding where the computational bottlenecks come from.

Another great resource are Stan Con submissions. It’d be great to have some indexing method; that way we can easily pull up papers on ODEs, reparameterization, etc. Maybe we can require keywords on submitted notebooks (tagging @yizhang, who is a StanCon organizer).

3 Likes

+1 to that.

And that!

It can be. The question is whether we go on attack ads vs. all the systems claiming they’re faster, or just stick to our message. I’m totally OK with the latter.

I think everyone agrees that speed’s important.

This is where we part ways. I think we’ll just look like crybabies or bullies if we try to directly counter claims that system X is faster than Stan.

Yikes. I’m sorry you feel that way. Do you have an alternative for fitting these models? this is how we build the most complex models we fit and we honestly don’t know other ways to fit them.

The documentation literally says that these models can’t be fit in the mixture models chapter of the users’ guide where they’re discussed.

This is where I’ve disagreed with people in the past. I don’t think it’s worth going after directly. And I also think we run just as much risk of using skewed examples and being unfair to the competition.

That’s where we’re disagreeing. I don’t think we should directly take on these other systems by communicating that Stan is really faster for problem X. It’s a slippery slope.

Why not blog the benchmarks then?

The harder question is where do we fare vs. TensorFlow? Which hardware do you use to evaluate? Which models? It’s just a mess.

A blog would be greatr. We’ve thought about it from time to time, but always bailed because nobody was willing to commit to anything beyond a “hello world” post and it seems like a moribund blog is worse than no blog at all.

It would have to be funded because nobody has time to do this kind of thing regularly. Unless we could someone get people to share in the writing. If we do try to fund it, the challenge will be finding someone qualified to do it who isn’t already overwhelmed by their work on Stan.

It’s not just that, it’s who they want to ship code to. I can learn Julia, but then who do I work with when everyone around me uses R or Python?

Glad it’s appreciated. We worked very hard on this in the early days.

This is exactly the kind of confusion @betanalpha is talking about. We’ve had endless discussiosn trying to explain that speed per iteration is meaningless, but it’s a subtle point.

It’d be interesting to run multiple chains and use Stan’s ESS estimator here. It discounts for non-convergence of chains.

You can also simulate data and measure squared error of estimates and back out the equivalent ESS so you don’t have to rely on the noise in the ESS estimates themselves.

Hey, can we quote you on that? That’s been our target all along.

No. For all the reasons @maedoc pointed out.

5 Likes

Indeed, that’s a great testimonial!

1 Like

please do: good engineering in academic software (or related to / coming from) is a precious commodity!

2 Likes

@Bob_Carpenter I’ve since done a fair bit of work on these comparisons. I now compare models with two datasets and a synthetic dataset using the ESS calculations in Stan for the comparison software. I was running Stan with multiple chains but the comparison software only runs one chain at a time so I switched to a single chain comparison. I have a conference paper submitted with the comparisons (noting that Stan would give roughly #coresx the number of samples in the same wall time).

I also compare VB in Stan against a custom-coded model using the synthetic dataset from the authors I referenced in an above post. I can’t reproduce the reported results with their python code. Stan’s VB does very well against their reported results though. They use RMSE against the synthetic data as their measure (similar to some others in my field). The issue is they run their MCMC for 100,000 draws and NUTS reproduces the results long before that point. In fact, I can very quickly reproduce the synthetic data in Stan running mean field VB (matches mean parameters well) then use the result as starting values for 2,000 NUTS samples (improves the match on covariance parameters).

I got help from @mike-lawrence here and my code is posted here.

4 Likes

That’s true for Stan and other MCMC only after all chains have converged. If R-hat >> 1 anywhere, then Stan’s ESS is going to discount multiple chains.

RMSE can be backed out into ESS, because the error distirbution in estimation is normal(0, sd / sqrt(ESS)), where sd is the variable’s posterior standard deviation and ESS its effective sample size. We usually test parameters and predictions this way.

We like to report ESS / second after warmup. You can do it per iteration, but that’s then hard to scale in practice. If the algorithms use the same calculations, you can scale it to number of log density and gradient evaluations (really, the gradients completely dominate calculations in Stan).

Cool. Thanks for citing and posting.

This whole thread’s full of good quotes. I’ve also found it easier in a lot of cases to just rebuild things in Stan. I always thought part of that is just that I know it really well as a dev, so of course it’s easier fo rme.

3 Likes

If a dataset is simulated you should have many of them and you should report statistics about the distributions of ESS/sec RMSE etc. This gives a lot more information and doesn’t unfairly penalise one method over the other just because of bad luck with the simulated data. This type of test gives a lot of information about how well a method does across models that are appropriate to the data.

1 Like

Thanks for the feedback. @anon75146577 it’s good to hear from a member of the UofT community. Unfortunately, I had to submit the conference paper but will keep your comments in mind when revising.

Thanks again for your suggestions regarding the Stan implementations we used to evaluate DynamicPPL and Turing. The latest results are not online atm as the respective paper is under review, but we will publish the results and implementations that we used for reproducibility asap.

To clear off a misconception, I read a few times about our paper. The aim of this work was not to misrepresent the efficiency of Stan. Quite the opposite is true, we intensionally evaluated against Stan as we consider it to be the most robust and efficient PPL, this is also mentioned in the paper and the reason why we tried our best in writing efficient Stan models to compare against – but failed to do so in some cases. Fortunately, we now have more efficient implementations thanks to the effort by @stevebronder.

At least the Turing team (including myself) consider Stan to be the most efficient PPL for static models.

Moreover, we don’t think that Turing.jl is faster on static models than Stan. Not sure where this sentiment comes from, our paper clearly states that this is not the case in general. I also don’t think that people use Stan as a benchmark punching bag if anything Stan is considered the most competitive baseline that exists.

I can only talk for the Turing project here. In our case, the code base is almost exclusively developed by passionate people on their spare time. No one is developing for Turing during his/her PhD. We are all from different backgrounds (and continents) and are simply passionate about providing a robust PPL for the Julia community. Just like the Stan team, we are dependent on third-party funding to support our efforts.

We had similar discussions in our team meetings and we heavily rely on Google Summer of Code (GSoC) and Google Season of Docs (GSoD) students. Maybe this could be an option for the Stan team too.

You can use R and Python almost seamlessly in Julia (and vice-versa) through bridges. Julia is a bit like a glueing language that inter-opts nicely with all kinds of existing languages.

6 Likes

Just to throw in some opinion as a non-stats, non-numerics based astrophysics person. I came to Stan from a world where likelihoods are improperly used (everything is an L2 norm), adhoc methods for everything to two-distribution tests, to custom ND-KS test based MCMC run around frequently. No one tests the validity of these approaches in literature and no one is particularly worried if inferences are proper if you get the hype-inducing kick of so-called discovery.

Some of these methods are “fast.” But I’ve messed around a lot with different PPLs/samplers and many of them give you incorrect answers for even the simplest models. I spent some time away from Stan in the -> days playing with PyMC3. I was not well versed in Stan and thought I could more easily implement some of the data specific models (instrument response, etc) there in a pythonic way. Have you ever tried for loops in those theano based PPLs?.. There were some speed differences, but I was frustrated at PyMC3 not accurately fitting simulated data . I recoded in Stan and it worked. It came down to numerical stability. Maybe that has changed, but it was a big lesson to me. I also didn’t like the “freedom” to combine different MCMC methods on different parameters. HMC this, and MCMC that, and Slice this. Where was the assurance any of this would lead to a robust inference??

While it may not be universal, I find most of these packages preaching features over fidelity. Speed over understanding. Hype over robustness.

As I focused on learning Stan, I learned to write rather complex models (coming from a c++ background helps) and this is no longer an issue for me, but it does present a barrier to some. I’ve heard people developing some Julia based HMC frameworks (not Turing) say “you can’t do anything complex in Stan.” This may be an advertising problem. I also am still rather novice on what makes efficient Stan code and while the documentation helps, it is a lot to swallow.

But again, I’m not here for speed, I’m here for robustness and robustness takes time to understand… hype will NOT solve that. One of the best talks about Stan is @Bob_Carpenter 's talk about the low-level parts of the backend https://youtube.com/watch?v=6NXRCtWQNMg. It’s this type of care that brings people that want to do things right to Stan.

I would love faster models, but I would rather have Stan focus on being correct and me learn to write better models than Stan focused on hype. The others that come to Stan are likely coming for the same reason that I come, robustness and a community that is more focused on good model building than the lastest GPU specs that will blow away a tensor flow model on a toy data set.

Stay away from hype and the user community will be the one you want.

11 Likes

I agree entirely, robustness and reliable inference should be the primary concern of any PPL.

The Turing team has only recently focused on working on a more efficient infrastructure as we got criticism about the performance based on sometimes wacky benchmarks by others. Our main concern was, and still is, to provide a robust library that is well embedded in the Julia ecosystem.

That said, speed is a major concern for many people, and it should be taken very seriously—especially given the new applications that are often concerned with large scale data.

I believe that a joint effort of the different PPL teams can substantially help to keep improving the respective systems in terms of reliability and speed. No one is looking for a wacky PPL just to use new algorithms, but ignoring recent advancements is also not the way forward.

3 Likes

I think what should be addressed there is that Stan is a bit of a closed world, which has its good and bad points. The good point is that it’s much easier to know what the universe of available functions are which makes it easier to document. The downside is that you don’t have an ecosystem with 10,000 developers contributing functionality. With Julia’s Turing.jl, we from SciML were demonstrating that Turing.jl + Julia’s DifferentialEquations.jl could be much more performant than Stan, which is not an indictment of Stan and its AD practices, but rather it’s an indictment of the ODE solvers that were chosen. From looking at the files, it looks like there’s really just two people maintaining that portion, and it’s just calling into Eigen and Sundials. While those are respectable solver libraries, we have shown in the SciMLBenchmarks repeatedly that there are usually alternative solvers that are more efficient than dopri5, adams, and bdf (those are quite old methods! None of them are even A-stable above order 2!), among other issues like Sundials adams is quite finicky in terms of reaching tolerance.

So our benchmark demonstrating advantages of Turing+DifferentialEquations.jl over Stan (which BTW were not performed by the Turing developers) is more about demonstrating what happens when you mix lively developer ecosystem (~20 active developers across SciML at a given time) with a full lively PPL ecosystem. I don’t think anyone has really sat down and optimized exactly how the ODE solvers of Stan will work on different subsets of problems, but with DifferentialEquations.jl there’s >300 solvers covering different cases in optimal ways, along with adaptive high-order implicit methods for SDEs, adaptive high order implicit methods for DDEs, etc. These are all extremely well tested (the benchmarks are fairly thorough, are updated roughly monthly through automated systems, and the CI takes >50 hours total if you count all of the convergence and regression testing going on). The Turing.jl community gets this for free by integrating with a language-wide AD system.

When we are timing Stan vs Turing in terms of differential equations, this kernel advantage is what’s demonstrated, and again it’s not an indictment of Stan but rather it’s an issue of focus, expertise, and language: it doesn’t look like a horde of Stan developers wake up every morning to optimize differential equation solvers and improve their accuracy, but in Julia that has been the case for a fairly long time now.

I think the main issue with Stan in this department is really just that the kernels are all baked in. Seeing an FFI (Foreign Function Interface in Stan's future) would be nice, because then external developers like us could add new differential equation solvers to Stan by making it call into Julia. There’s nothing wrong with Stan’s design or Stan’s mission, it’s just not a mission that means you have 10 choices of GPU-accelerated preconditioners for a CFD simulation, and so Stan will not benchmark in those regimes simply due to the implemented kernels. PPLs like Turing have a different mission of allowing a whole general-purpose language be extended in probabilistic ways and integrate with language-wide automatic differentiation, which in turn makes it do well on things which are not part of the traditional sphere of Bayesian inference.

It’s not an advertising problem. Unless the docs and the Github repo are mistaken, Stan doesn’t have a DDE solver, so External delay differential equation solver is a real issue to somebody. It doesn’t have an SDE solver, so Reverse mode SDE idea is a real issue to somebody. It probably doesn’t intersect with the vast majority of Stan users (Stan users probably pre-select away from being the core differential equation modeling crew), but I think calling widely used and effective models as just pure hype is a bit disingenuous to people’s research that doesn’t fall into the currently available Stan kernels.

I think it’s fine to just say that doesn’t fall into the general mission of Stan: Stan does what it does well, and people respect that.

9 Likes

Just to clarify a point. This is the new ODE interface that was added basically two weeks ago and was done by Ben. Many more devs have contributed to the old ODE interface.

4 Likes

A developer of Turing.jl here. I would love to see a community-wide extensive benchmark started by the Stan community that punches back every PPL out there ;) It’s hard to be an expert in optimizing models in every PPL. Naturally, one has to start the benchmarks somewhere. Those benchmarks will by definition make claims about the performance of various PPLs; some will be slower than others. You just can’t make everyone happy! Now if you think inaccurate claims are being made about your favorite PPL, I think the healthy thing to do in this case is what @stevebronder did in the Turing vs Stan case which is contributing optimizations to the benchmark. I think if everyone has good intentions, those contributions will be warmly received. They certainly were in the Turing case. And at the end of the day, I think it’s important to keep in mind that no one is trying to take anyone down. It’s natural to want the PPL you are developing to be better in every way but that’s just not possible sometimes. So imo the ultimate goal of these benchmarks should be to identify strengths and weaknesses of different PPLs. The weaknesses can be worked on and improved and the strengths can be highlighted to attract users and funding. This is how the game works! Finally, I can’t stress enough how much respect the Turing devs have for Stan as a PPL and the Stan devs and community. In many ways, your work has revolutionized the field of probabilistic programming and has inspired many people to join the field so for that we are grateful :)

13 Likes

Chris, thanks for following up about the FFI topic. For my present project, Stan is what my collaborators require for and is the way to go because of the easy log likelihood calculation, so when time is available, I’d definitely like to contribute to the state of the differential equation solvers in Stan (which would also help me finish my projects and, alas, finally allow me to graduate). There is some additional ongoing discussion in this thread about figuring out differential equation development priorities in Stan here: Stan differential equation functionality development priorities.

Because the FFI situation was unknown, I had a growing interest in contributing to an auto-switching ODE solver, which would probably be the algorithm that would be most beneficial to my own projects. However, we did weigh some pros and cons of building solvers versus putting resources instead to forward an FFI. If the Stan devs do decide to put resources toward an FFI after further discussion, efforts towards building native solvers in Stan Math would be a less efficient and redundant goal.

3 Likes

Sorry, I did not mean to imply that there is not principled modeling outside of Stan. Speed of course matters, but my experience is similar to those above that parameterization almost always gains more speed than facilities. Of course, they work together and coordination of the guts of these things is key.

It’s more about the direction of the Stan community as @betanalpha has pointed out. At some point, there will be something better in terms of ease of coding for speed, to the silicon coding, etc. But I’ve seen in my field that speed based approaches/methods/languages typically address correctness later. Again, this doesn’t have to be universal, but it’s a trend and in the academic world it is very bad for early students to learn.

The ODE stuff is a relevant point for sure. Even in astro we do not see such a need for that, but this is highly field dependent and that type of stuff is definitely worth investing. I wonder as I do not know, if the approaches in Stan are based off numeric stabilty etc. One can certainly autogrid through all of Julia, but does it translate to stability for free? I remember being frustrated that Stan couldn’t do high-D interpolation out of the box… but you begin to understand that this is a very bad thing to do with HMC in the first place. In PyMC3 you can sample integers… not a good thing…

Off topic: how much of the things that come for free in jitted languages does not require some of the carefulness that goes into delicate low-level numerical code? I’m asking out of ignorance.

3 Likes

Just wanted to say I appreciate everyone taking the time to chime in here. Thanks to the Stan users and devs as well as the Julia and Turing folks!

11 Likes

I will take a crack at responding here but I will try to be brief not to derail the original discussion too much. I can speak for Julia and Turing. First of all, nothing in Julia is completely for “free” but the asymptotic cost is kind of free because you simply have to pay the cost of developing a library once. Then this library can be used along with an exponential number of combinations of packages with no additional development cost. This talk describes why this super-synergy happens in Julia but not in a language like C++ for example. What this means is that I can run an ODE solver using a high (or low) precision fancy number type on the GPU and embed it in a Turing model to run MCMC on the ODE parameters or perform MLE or MAP using an optimization package without the developers of the ODE solver, the GPU package, the fancy number type, the optimization package or Turing ever trying to make their packages compatible with each other, in the ideal case. In reality, there are a few things to keep in mind when developing a package to make it sufficiently “generic” but they are not many and can be taught to beginners.

What the above synergy does is create a system where a clear separation of concerns tends to happen. The ODE folks can develop and thoroughly test ODE solvers, the number type folks can develop and test number types, the optimization folks can develop and test optimization algorithms, etc. Every one needs to make sure their implementations are correct, numerically stable, fast, etc. So the low level care is there when developing each package but the package is only developed once and then re-used in more ways than the author ever imagined. I hope that answered your questions.

This reply turned out longer than I had liked. To the admins, feel free to move this comment to a separate discussion if you think it’s off-topic.

5 Likes

We’ve been using Stan for DDE and SDE problems for years now, and the difficulty is always parametrizing the model for good posterior geometry.

This is quite the same point that you are making in your own SciML and UDE papers, namely that better model structure is the way forward for machine learning.

Having Stan make use of Julia solvers is a matter of plumbing in the software, either at the Stan math library level or in the compiler (eg targeting Turing.jl).

1 Like