Oh yeah, that’s really frustrating when people say that their Gibbs sampler is faster than Stan but then don’t provide effective sample size comparisons.
After seeing the growing library of beautiful writings from stats celebrity @betanalpha, I definitely think some more guides on the Stan website that discuss more specialized optimization topics would be a great thing. Some writings about parameterizing and checking of ODE systems would be useful to many, particularly those in mathematical biology-tangential fields for whom model comparison is a very active area of focus these days. For instances, @bbbales2 and @charlesm93’s optimization of the Stan COVID model was a pretty cool thing that could be expounded upon in greater detail as a case study on the website.
The challenge in optimizing Stan programs is that there’s only so much one can do within a given program. Most of the optimizations are more statistical and involve identifying useful techniques to identify a Stan program compatible with a given analysis circumstance, including inferential goals and available domain expertise.
@betanalpha Do you want to link to more of your case studies from the Stan website? They’re really good and it would get more eyes on them if we had links. I’m sure plenty of people (like @xiehw) are finding them already, but we could probably help other people find them more easily. Totally up to you of course.
Despite the promise of big data, inferences are often limited not by the size of data but rather by its systematic structure. Only by carefully modeling this structure can we take fully advantage of the data – big data must be complemented with big models and the algorithms that can fit them. Stan is a platform for facilitating this modeling, providing an expressive modeling language for specifying bespoke models and implementing state-of-the-art algorithms to draw subsequent Bayesian inferences.
Is the implication that Stan based approaches should be used to fit Big Data problems? In my experience with Stan, data size is perhaps singularly the biggest contributor to performance degradation and the relationship is ofttimes exponential.
This is starting to take the thread off topic so I suggest a new post if you wanted to start a separate discussion, but I’ll try to respond quickly as to not leave the tangent dangling.
Most models will depend on the data size linearly, so the cost of the gradient evaluation will scale linearly. Consequently superlinear decrease in performance has to mean that the additional data is leading to nastier posterior density functions that push Hamiltonian Monte Carlo to work harder and use more of the gradient evaluations.
Adding perfect replications of a component measurement will typically make the likelihood function, and hence posterior density function, more bell-curved and hence easier to handle. If your “big data” is actually asymptotic data then the performance should increase superlinearly with increasing data, and you can verify this with simulations (starting with something that it’s normal for small data sets).
In other words the rapidly decreasing performance indicates that your data aren’t clean replications. They bring with them their own systematic structure, or the increased data allows you to better resolve systematic structures that were always there and hence need to be addressed. The only way to capture all of these effects and yield reasonable inferences is to incorporate them into the model, hence " big data must be complemented with big models".
The reality of “big data” is messy, complex, structured data. And the only way to take full advantage of that circumstance are sophisticated models, which requires tools like Stan.
Given the increased disagreements about best practices and modeling priorities within the community I would personally prefer to keep all of my material on my own site as to not give anyone the impression that it represents the opinions of anyone but myself. I would have no problem, however, with a link to my writing page, https://betanalpha.github.io/writing/, if people were interested.
That said all of the case studies are BSD-3 and CC BY-NC 4.0 licensed so reposting them on the Stan site wouldn’t technically require my permission given the proper attribution.
I think a page or section linking to Michael Betancourt’s case studies and guides is the way to go as an initial medium. This could be an expansion of the “Case Studies and Notebooks” section of the Documentation page. There could also be a separate “Case Studies and Examples” tab added to the website navigation bar, which I think would make things more accessible. But yes, since things are heading off-topic, I’m going to create a new thread for discussion about website modifications.
There is in fact a case study on Stan’s website that covers the tricks we used and principles we applied; and also an arxiv preprint with the same content :) See in particular section 3. To some extent, it’s mostly a matter of peeking inside the black box and understanding where the computational bottlenecks come from.
Another great resource are Stan Con submissions. It’d be great to have some indexing method; that way we can easily pull up papers on ODEs, reparameterization, etc. Maybe we can require keywords on submitted notebooks (tagging @yizhang, who is a StanCon organizer).
It can be. The question is whether we go on attack ads vs. all the systems claiming they’re faster, or just stick to our message. I’m totally OK with the latter.
I think everyone agrees that speed’s important.
This is where we part ways. I think we’ll just look like crybabies or bullies if we try to directly counter claims that system X is faster than Stan.
Yikes. I’m sorry you feel that way. Do you have an alternative for fitting these models? this is how we build the most complex models we fit and we honestly don’t know other ways to fit them.
The documentation literally says that these models can’t be fit in the mixture models chapter of the users’ guide where they’re discussed.
This is where I’ve disagreed with people in the past. I don’t think it’s worth going after directly. And I also think we run just as much risk of using skewed examples and being unfair to the competition.
That’s where we’re disagreeing. I don’t think we should directly take on these other systems by communicating that Stan is really faster for problem X. It’s a slippery slope.
Why not blog the benchmarks then?
The harder question is where do we fare vs. TensorFlow? Which hardware do you use to evaluate? Which models? It’s just a mess.
A blog would be greatr. We’ve thought about it from time to time, but always bailed because nobody was willing to commit to anything beyond a “hello world” post and it seems like a moribund blog is worse than no blog at all.
It would have to be funded because nobody has time to do this kind of thing regularly. Unless we could someone get people to share in the writing. If we do try to fund it, the challenge will be finding someone qualified to do it who isn’t already overwhelmed by their work on Stan.
It’s not just that, it’s who they want to ship code to. I can learn Julia, but then who do I work with when everyone around me uses R or Python?
Glad it’s appreciated. We worked very hard on this in the early days.
This is exactly the kind of confusion @betanalpha is talking about. We’ve had endless discussiosn trying to explain that speed per iteration is meaningless, but it’s a subtle point.
It’d be interesting to run multiple chains and use Stan’s ESS estimator here. It discounts for non-convergence of chains.
You can also simulate data and measure squared error of estimates and back out the equivalent ESS so you don’t have to rely on the noise in the ESS estimates themselves.
Hey, can we quote you on that? That’s been our target all along.
@Bob_Carpenter I’ve since done a fair bit of work on these comparisons. I now compare models with two datasets and a synthetic dataset using the ESS calculations in Stan for the comparison software. I was running Stan with multiple chains but the comparison software only runs one chain at a time so I switched to a single chain comparison. I have a conference paper submitted with the comparisons (noting that Stan would give roughly #coresx the number of samples in the same wall time).
I also compare VB in Stan against a custom-coded model using the synthetic dataset from the authors I referenced in an above post. I can’t reproduce the reported results with their python code. Stan’s VB does very well against their reported results though. They use RMSE against the synthetic data as their measure (similar to some others in my field). The issue is they run their MCMC for 100,000 draws and NUTS reproduces the results long before that point. In fact, I can very quickly reproduce the synthetic data in Stan running mean field VB (matches mean parameters well) then use the result as starting values for 2,000 NUTS samples (improves the match on covariance parameters).
That’s true for Stan and other MCMC only after all chains have converged. If R-hat >> 1 anywhere, then Stan’s ESS is going to discount multiple chains.
RMSE can be backed out into ESS, because the error distirbution in estimation is normal(0, sd / sqrt(ESS)), where sd is the variable’s posterior standard deviation and ESS its effective sample size. We usually test parameters and predictions this way.
We like to report ESS / second after warmup. You can do it per iteration, but that’s then hard to scale in practice. If the algorithms use the same calculations, you can scale it to number of log density and gradient evaluations (really, the gradients completely dominate calculations in Stan).
Cool. Thanks for citing and posting.
This whole thread’s full of good quotes. I’ve also found it easier in a lot of cases to just rebuild things in Stan. I always thought part of that is just that I know it really well as a dev, so of course it’s easier fo rme.
If a dataset is simulated you should have many of them and you should report statistics about the distributions of ESS/sec RMSE etc. This gives a lot more information and doesn’t unfairly penalise one method over the other just because of bad luck with the simulated data. This type of test gives a lot of information about how well a method does across models that are appropriate to the data.
Thanks for the feedback. @anon75146577 it’s good to hear from a member of the UofT community. Unfortunately, I had to submit the conference paper but will keep your comments in mind when revising.
Thanks again for your suggestions regarding the Stan implementations we used to evaluate DynamicPPL and Turing. The latest results are not online atm as the respective paper is under review, but we will publish the results and implementations that we used for reproducibility asap.
To clear off a misconception, I read a few times about our paper. The aim of this work was not to misrepresent the efficiency of Stan. Quite the opposite is true, we intensionally evaluated against Stan as we consider it to be the most robust and efficient PPL, this is also mentioned in the paper and the reason why we tried our best in writing efficient Stan models to compare against – but failed to do so in some cases. Fortunately, we now have more efficient implementations thanks to the effort by @stevebronder.
At least the Turing team (including myself) consider Stan to be the most efficient PPL for static models.
Moreover, we don’t think that Turing.jl is faster on static models than Stan. Not sure where this sentiment comes from, our paper clearly states that this is not the case in general. I also don’t think that people use Stan as a benchmark punching bag if anything Stan is considered the most competitive baseline that exists.
I can only talk for the Turing project here. In our case, the code base is almost exclusively developed by passionate people on their spare time. No one is developing for Turing during his/her PhD. We are all from different backgrounds (and continents) and are simply passionate about providing a robust PPL for the Julia community. Just like the Stan team, we are dependent on third-party funding to support our efforts.
We had similar discussions in our team meetings and we heavily rely on Google Summer of Code (GSoC) and Google Season of Docs (GSoD) students. Maybe this could be an option for the Stan team too.
You can use R and Python almost seamlessly in Julia (and vice-versa) through bridges. Julia is a bit like a glueing language that inter-opts nicely with all kinds of existing languages.
Just to throw in some opinion as a non-stats, non-numerics based astrophysics person. I came to Stan from a world where likelihoods are improperly used (everything is an L2 norm), adhoc methods for everything to two-distribution tests, to custom ND-KS test based MCMC run around frequently. No one tests the validity of these approaches in literature and no one is particularly worried if inferences are proper if you get the hype-inducing kick of so-called discovery.
Some of these methods are “fast.” But I’ve messed around a lot with different PPLs/samplers and many of them give you incorrect answers for even the simplest models. I spent some time away from Stan in the -> days playing with PyMC3. I was not well versed in Stan and thought I could more easily implement some of the data specific models (instrument response, etc) there in a pythonic way. Have you ever tried for loops in those theano based PPLs?.. There were some speed differences, but I was frustrated at PyMC3 not accurately fitting simulated data . I recoded in Stan and it worked. It came down to numerical stability. Maybe that has changed, but it was a big lesson to me. I also didn’t like the “freedom” to combine different MCMC methods on different parameters. HMC this, and MCMC that, and Slice this. Where was the assurance any of this would lead to a robust inference??
While it may not be universal, I find most of these packages preaching features over fidelity. Speed over understanding. Hype over robustness.
As I focused on learning Stan, I learned to write rather complex models (coming from a c++ background helps) and this is no longer an issue for me, but it does present a barrier to some. I’ve heard people developing some Julia based HMC frameworks (not Turing) say “you can’t do anything complex in Stan.” This may be an advertising problem. I also am still rather novice on what makes efficient Stan code and while the documentation helps, it is a lot to swallow.
But again, I’m not here for speed, I’m here for robustness and robustness takes time to understand… hype will NOT solve that. One of the best talks about Stan is @Bob_Carpenter 's talk about the low-level parts of the backend https://youtube.com/watch?v=6NXRCtWQNMg. It’s this type of care that brings people that want to do things right to Stan.
I would love faster models, but I would rather have Stan focus on being correct and me learn to write better models than Stan focused on hype. The others that come to Stan are likely coming for the same reason that I come, robustness and a community that is more focused on good model building than the lastest GPU specs that will blow away a tensor flow model on a toy data set.
Stay away from hype and the user community will be the one you want.
I agree entirely, robustness and reliable inference should be the primary concern of any PPL.
The Turing team has only recently focused on working on a more efficient infrastructure as we got criticism about the performance based on sometimes wacky benchmarks by others. Our main concern was, and still is, to provide a robust library that is well embedded in the Julia ecosystem.
That said, speed is a major concern for many people, and it should be taken very seriously—especially given the new applications that are often concerned with large scale data.
I believe that a joint effort of the different PPL teams can substantially help to keep improving the respective systems in terms of reliability and speed. No one is looking for a wacky PPL just to use new algorithms, but ignoring recent advancements is also not the way forward.
I think what should be addressed there is that Stan is a bit of a closed world, which has its good and bad points. The good point is that it’s much easier to know what the universe of available functions are which makes it easier to document. The downside is that you don’t have an ecosystem with 10,000 developers contributing functionality. With Julia’s Turing.jl, we from SciML were demonstrating that Turing.jl + Julia’s DifferentialEquations.jl could be much more performant than Stan, which is not an indictment of Stan and its AD practices, but rather it’s an indictment of the ODE solvers that were chosen. From looking at the files, it looks like there’s really just two people maintaining that portion, and it’s just calling into Eigen and Sundials. While those are respectable solver libraries, we have shown in the SciMLBenchmarks repeatedly that there are usually alternative solvers that are more efficient than dopri5, adams, and bdf (those are quite old methods! None of them are even A-stable above order 2!), among other issues like Sundials adams is quite finicky in terms of reaching tolerance.
So our benchmark demonstrating advantages of Turing+DifferentialEquations.jl over Stan (which BTW were not performed by the Turing developers) is more about demonstrating what happens when you mix lively developer ecosystem (~20 active developers across SciML at a given time) with a full lively PPL ecosystem. I don’t think anyone has really sat down and optimized exactly how the ODE solvers of Stan will work on different subsets of problems, but with DifferentialEquations.jl there’s >300 solvers covering different cases in optimal ways, along with adaptive high-order implicit methods for SDEs, adaptive high order implicit methods for DDEs, etc. These are all extremely well tested (the benchmarks are fairly thorough, are updated roughly monthly through automated systems, and the CI takes >50 hours total if you count all of the convergence and regression testing going on). The Turing.jl community gets this for free by integrating with a language-wide AD system.
When we are timing Stan vs Turing in terms of differential equations, this kernel advantage is what’s demonstrated, and again it’s not an indictment of Stan but rather it’s an issue of focus, expertise, and language: it doesn’t look like a horde of Stan developers wake up every morning to optimize differential equation solvers and improve their accuracy, but in Julia that has been the case for a fairly long time now.
I think the main issue with Stan in this department is really just that the kernels are all baked in. Seeing an FFI (Foreign Function Interface in Stan's future) would be nice, because then external developers like us could add new differential equation solvers to Stan by making it call into Julia. There’s nothing wrong with Stan’s design or Stan’s mission, it’s just not a mission that means you have 10 choices of GPU-accelerated preconditioners for a CFD simulation, and so Stan will not benchmark in those regimes simply due to the implemented kernels. PPLs like Turing have a different mission of allowing a whole general-purpose language be extended in probabilistic ways and integrate with language-wide automatic differentiation, which in turn makes it do well on things which are not part of the traditional sphere of Bayesian inference.
It’s not an advertising problem. Unless the docs and the Github repo are mistaken, Stan doesn’t have a DDE solver, so External delay differential equation solver is a real issue to somebody. It doesn’t have an SDE solver, so Reverse mode SDE idea is a real issue to somebody. It probably doesn’t intersect with the vast majority of Stan users (Stan users probably pre-select away from being the core differential equation modeling crew), but I think calling widely used and effective models as just pure hype is a bit disingenuous to people’s research that doesn’t fall into the currently available Stan kernels.
I think it’s fine to just say that doesn’t fall into the general mission of Stan: Stan does what it does well, and people respect that.