At today’s general meeting (April 2, 2020) Andrew brought up the reputation of Stan as not being fast in comparison to various competing implementations. It was noted that this is often due to badly written models. I’ll start with my contribution but will list what I recall from others as well.
If Stan were a company we would be looking at this as a marketing issue. We are not a company but we are competing with companies, e.g. Google’s tensor flow probability, so we should consider actively managing our reputation. I suggest we consider hiring someone to guide this process. The SGB has resources, e.g. $, and the ability to seek out and engage a PR (Public Relations) professional.
Compiler improvements with Stan 3 may be a big help here since optimization can be done for poorly coded models.
We could put up a web page for speed test models that we know are well coded. I’ll add that some SEO (Search Engine Optimization) might make sense so that the page is the #1 hit for relevant queries.
We could publish our own speed tests.
We could chase down and ‘refute’ speed test claims. Not considered a good idea by the originator.
We should pull the poorly coded example models from documentation and the web although it may be too late.
Updating the Stan examples is pretty important imo, as I mentioned in the meeting there was a paper for a julia package that reported they were 15x faster for logistic regression. They used the examples in the example repo for benchmarks and after updating the stan model code we were about 3% faster for the logistic regression and twice as fast for the high dimensional Gaussian example (their paper had their package running 30% faster for the high dim guass example)
imo they didn’t do anything wrong, if we are putting that code out there as example model code it would make sense that in some way we’ve given it a thumbs up.
It was noted that this is often due to badly written models.
If this is true then it’s news to me, which imo does make this issue a PR/documentation problem. (Although I’m not the typical Stan user.)
From my experience there are two issues at play here:
Stan is faster/slower than other Bayesian modeling frameworks (e.g. Tensorflow Probability, etc).
Stan is faster/slower than than popular modeling frameworks (e.g. Spark ML library, etc)
In a some industries you’re dealing with (1). Namely, those industries that have bought into Bayesian methods. Treating this as a PR/documentation problem regarding speed may help.
In other industries you’re dealing with (2). Correct me if I’m wrong but afaik linear regression with Spark on terabytes of data will outperform linear regression with Stan. In this case you’d want to treat this as a problem regarding what additional information Stan has to offer over frequentist methods.
I’m not sure I get what you’re saying, but I’ll say this: as a Stan enthusiast but not someone who’s directly involved with Stan’s development or in NumFocus’s payroll, my opinion as a statistician is: of course you get great performance in specific examples if you tailor your implementation towards certain models. So that’s really not a fair comparison. And I don’t think anybody who knows a thing or two about computational statistics is out there saying “Stan sucks for [insert statistical task here], the guys at [insert dank new software package here] are really outshining them”. The way I, as a practicing statistician see it, Stan is general “statistical modelling”* language, which works really well for the majority of the models I care about, and I have a wide range of interests.
Maybe the way to show that Stan is fit for purpose in the sense that it accommodates a wide range of models with reasonable sampling efficiency is to write a paper for a statistics journal where one analyses a selection of models as diverse as possible and compares with other modern implementations in other general purpose packages, such as Nimble. One example would be a model which has discrete latent variables which can be written in marginalised form in Stan and “in full” in Nimble.
I don’t think this is something we should be trying to outsource.
The hard part would be agreeing on the set of models and how to compare performance.
This I could go for. I don’t really care about SEO though.
As a developer/user, I would like to be able to know:
How fast is Stan vs. anything else that does NUTS on hierarchical models/things in our wheelhouse
What is the advantage of doing multi-threaded stuff in Stan
What is the advantage of doing GPU stuff in Stan
Maybe there’s a fourth thing just so it could be a page with four plots lol. And maybe a webpage is overkill and I should just ask more people for that info. But it seems like the kind of stuff that gets outdated and changes.
I gave an interview about 6 months ago and told the interviewee that me and my team have written Stan models that handle millions of data points and tens of thousands of parameters that takes about an hour to run. It took us a few attempts and we had to re-write portions. Their response was that that seemed surprising because when they tried Stan they found it too slow and their company wrote their own MCMC framework that was faster. I thought this seemed extremely wasteful on their resources and probably it was a brute force framework that could run on many compute nodes/GPUs vs actually being faster (and potentially much slower on a per cpu-per ESS measure).
If you take the time to read through the docs there’s a bunch of great material to help with optimization. This is awesome for those of us who take the time to read these. Many people don’t take the time and I don’t see how that’s going to change. A few things I think that are worthwhile is showing on a performance page that Stan’s HMC algorithm is state-of-the-art regardless of language implementation. Which reiterates something I read on Andrew’s blog about having the best algorithm and then worrying about scaling vs doing things the reverse way.
There also seems to be a disconnect when I talk about Stan to people at work about what it looks like and how one actually programs in it. It initially appears “hard” to learn and co-workers seem to gravitate to TensorFlow because they’re familiar with it. But the wealth of info, models and info from the forums makes this relatively “easy” to learn. I think the SlicStan thing was a great step to making Stan more similar to other languages along with some extra sugar like defining multiple similar types in one line such as,
real alpha, beta, gamma ~ std_normal();
Other things that probably will exist in the future and will help:
Basically anything that lets the user write a model in the most straightforward way and then getting out nearly optimized code would really help this cause. I know that’s a huge ask!
OK, I’m replying to everyone at once, so bear with me.
Anything in the user’s guide or the example-models repo. That includes all the BUGS examples, all the Gelman and Hill examples, etc.
Did you not realize that how the model is written makes a difference? How about how it’s parameterized? Did you try to find ways to make models go faster and not find the chapters on efficiency and reparameterization in the user’s guide?
Does anyone have any ideas on how we can get people to read something like an overview of Stan before writing. I find it hard because we have too much doc, not too little. For example, new dev doc is out of control.
Glad someone saw them. But this also brings up the big problem we have—too much doc for anyone to get through. And it’s fragmented by language (R, Python, etc.) and by application area.
Yup. That’s a battle we’re not going to win. Nor are going to compete with TensorFlow or PyTorch on fitting neural network models. Or against custom genomics models for gene alignment.
I think it depends on where you come from. If you’ve been using NumPy for years in Python, TensorFlow’s a lot more familiar than Stan. Whereas if you’ve been using BUGS or JAGS, Stan will be more familiar. If you come from C++/Java or other typed imperative languages, Stan will seem familiar, whereas those coming from dynamically typed languages like R or Python have more trouble.
Our transforms are “bijectors”. We just need to allow them to compose so we can non-center positive-constrained and probability-constrained hierarchical models.
Automatic reparameterization is very hard. It’s what the ML people call “normalizing flows”. it depends on aspects of the data and prior, not just the model structure.
People are working on automatic marginalization and automatic vectorization, but both are hard to do in general.
Lots of people say things like that, and many of them are well credentialed in computational stats.
That was the goal in building it in the first place. Of course, we want it to be fast, too, which is part of working really well. But it’s not the only part.
Monnahan, Thorson, and Brants did that for some ecology models w.r.t. JAGS. It turns out to be faster to marginalize in JAGS, too. But the real problem is that if you don’t marginalize, it’s very hard to fit someting like an HMM by sampling latent states.
Excellent questions and I’d add MPI to that list of parallelizations.
But then you have to break all this down. What about ODE models? Do we compare against dedicated packages like NONMEM and Monolix?
I think the SEO will come for free if we put a page up with good content. We have a lot of internet juice from the user base.
I’m all in favor. Like a lot of these things, like publishing our own speed tess, we just need someone to do it. To do speed tests, we need to be able to do accuracy tests if they’re going to be anything other than log density and gradient speed evaluation, but even then we need arithmetic accuracy for a fair measurement (e.g., 32-bit vs. 64-bit).
I also hear the countersuggestion regularly that we just leave all the badly coded models in the example-models repo because they’re good for testing.
Hi–I just wanted to respond to this one point. It is an interesting and somewhat counterintuitive aspect of Stan that there are many different ways of writing even a fairly simple model, where computation time can change by a factor of 10 or more.
Consider some comparison points:
Math. Many people come into statistics from math or some exact science. In math, if there are equivalent ways of writing the formula, there’s the same thing.
Basic numerical analysis. People such as me with vague understanding of numerical analysis know that there are some pitfalls (for example, don’t do subtractions such as 34049.34 - 34049.33, or computations such as log(1.0000013)), and we know there are some special tricks for linear algebra. But, again, we would have no idea that a wrong parameterization can destroy you in a Stan model.
R. I don’t know Python, but in R I know not to do loops if they can be avoided and not do matrix inversions (e.g., I know not to do X-transpose-X-inverse etc.), but, again, that’s about it. As a user, this gives me the general impression that in a computer language I can pretty much write my expressions in a natural way and I just have to avoid a few bad practices.
Writing your own optimizer or Gibbs or whatever. What is efficient here does not necessarily have anything to do with what is efficient in Stan.
The point is that users are coming into Bayes and Stan with the general idea that (a) they can write their model in math and then express it in Stan, and (b) as long as they avoid one or two obvious "don’t"s, they’ll be basically OK.
Here are a couple messages that are not so obvious:
You shouldn’t feel too tied to the details of your model. What’s most important is not what you do with your data, but what data you include.
Reparameterizations can make a big difference, and not always in an obvious way. You have to read the users guide or a recent, high-quality case study.
This is even before we get to all the details (don’t use noninformative priors, don’t use hard constraints, put parameters on a unit scale where possible, don’t do inv-gamma(epsilon, epsilon), don’t forget that it’s normal(mu, sigma) not sigma^2, don’t get tangled going back and forth between vectors and arrays, Stan uses vector-matrix math rather than componentwise math for vector/matrix multiplication and division, etc etc etc).
So, there’s a lot going on, in that we are confounding people’s expectations all the time! It’s not enough to ask people to read an overview, unless we know what’s gonna be in it!
Gotcha. I’ll start browsing through to see which might be obviously sped up. I suggest that we might want to have multiple versions of a given model, one that includes all possible optimizations, but also maybe some less-optimized versions as I suspect that the optimized versions may be harder for beginners to grasp (depending on the optimizations applied). If a naming scheme were adopted, such as appending “_fast” or “_optimized”, this would make it obvious to speed-testers which should be used for comparisons.
Honestly I don’t know the details but I know enough to ask for help which means outsourcing. If I was still exec dir I’d go talk to NumFOCUS to see if they knew anyone to consult, then ask around and get some guidance. My assumption is that this will not be cheap to implement and would be a combination of:
Publishing performance benchmarks.
Offering standard models to compare to that are well coded.
Refute claims as they come up.
Presenting at relevant conferences with speed oriented presentations.
Get in popular press, e.g. get in NYTimes, Economist, WSJ etc… Generally that means writing easily recycled tech articles and sending them to tech writers. Something like ‘Stan and the Fight against the Corona Virus’. Make sure speed statements are in there.
Layer in SEO which may come for free but we could buy adwords for ‘tensor flow’. Search for ‘stan bayesian programming’ and a link to tensor flow comes up–I don’t know if they paid for it on Google.
This is a big job. Whether the SGB/Stan community wants to take it on is another question.
Usually the two different ways are good for different things. One perspective might help with an algebraic proof and another with a geometric one.
That’s still the goal. There may be better or worse ways to write things for efficiency, but any way you get the right definition should work. The don’ts are much more complicated and stats related and at the boundaries, numerical analysis related.
We thought we’d just translate the models from BUGS verbatim, gamma(epsilon, epsilon) priors and all so we could compare apples-to-apples on sampling speed.
We thought we’d write the simplest possible code for each model (pretty much a direct translation), then provide optimized versions.
But as soon as we provide simple code that’s slow, it’s going to get picked up and reused.
I was trying to ask what kind of help you think we need and what kind of person you’re going to ask for what. From your list of bullet points, the only one that a PR consultant could help with is with ins in the popular press.
I didn’t realize that the speed efficiency was referring to the efficiency/reparameterization chapters. Yes that info does help and is up to the modeler. I assumed it was referring to some additional functionality to make Stan faster when applying it to big data and when compared with other leading modeling frameworks.
I will say that, when I’ve presented code, I’ve usually tried to only show “optimized” versions publicly (vectorized, non-centered parameterization, etc), since my thought has been that some will want the “default” version of a model to “just work,” and as a result, the first thing they should see should be the best (defined however you want). This may be tougher to read initially, but will also work better, which might be helpful. One caveat might be that there are, say, 3 versions of a model: single thread, multithreaded, and GPU, but that’s a clearer decision for the end user.
posteriordb aims to make it easy to make also speed tests with hundreds of models (this use case is listed as Performance testing in use cases doc. The idea is to have also tags so that it’s possible to explore which kind of models are fast and which are not instead of having, e.g., non-hierarchical logistic regression as the only case for speed tests. Currently making posteriordb is supported only by my funding. It would be nice to have more resources especially for adding well coded models and reference posterior results (we should not speed test cases where the inference fails). I hope posteriordb could be eventually moved under stan-dev in github.
I hope this could be coordinated with posteriordb so that improved models would end up to the posteriordb.
We should encourage others to use posteriordb and make speed tests with hundreds of models. We have support for R and Python, arbitrary algorithms using either Stan as an engine, but with stanc3 we can also get TF probabilities models, and in the future we can support having other engine specific model code, too. But the important starting point would be to have our Stan reference results made.
You are entirely correct in my opinion and posteriordb would be a big step in the right direction to control the conversation around Stan performance and model quality. PosteriorDB (I liked the Bayes Bench name but posteriordb is more descriptive), has the advantage of addressing many of the suggested responses while fitting well with the style of Stan ecosystem development.
Posteriordb does not address the PR issue in mainstream media but I have my doubts that we have the inclination to go out and do something as unacademic as doing active marketing like buying adwords or interfacing with journalists with easy to repurpose ‘Stan is the next great thing and fast’ articles.
It will go forward without SGB, but with more support it would be ready sooner. I understand this is a priorization issue and there are many other important projects in need of support, too. Now, I’ll just try to make SGB and other people aware of posteriordb.