Addressing Stan speed claims in general

Can you come up with $ cost and send along to the SGB? Or I am happy to help develop the proposal.

Breck

This is a contentious topic and I’m mostly reticent to make any comments, but I do want to make a few important points before running away.

Advertising and hype are almost entirely unproductive for growing the Stan community because the factors that appeal to the majority of the population are misaligned with the goals of Stan. The machine learning, and progressively academic statistics, hype machine brings people in with misleading promises only to leave them disappointed afterwards.

The goals of Stan to this point have been to facilitate users building their own models that capture the bespoke structure of their experiments and then offering computation that is as fast as possible but most importantly as diagnostic as possible so that users know whether or not their results are faithful to their model. This includes algorithmic work as well as programming work to make Stan as fast as possible on commodity hardware and then take advantage of specialized hardware where possible.

Being faster at simple models like linear regression with millions of data points says nothing about these goals. In fact I’m going to make the claim that this example isn’t even relevant to applied statistics! After tens of thousands of observations one’s uncertainties are so small that the utility of one’s inferences are limited by the bias of the linear modeling assumptions, not the amount of data that one can process. From an applied perspective it’s a red herring.

Being faster at fitting an arbitrary suite of models without testing for computational accuracy is no better. The BUGS examples are terrible because most of then never really fit in BUGS to begin with. Until they are updated in response to HMC’s more precise diagnostics (which would be a useful task and I highly recommend anyone interested in contributing to give it a try and ask question on the forums about particular attempts, especially in collaboration with the posteriordb team) they just allow one to compare how quickly one can arrive at the wrong answer.

The posteriordb is absolutely the right step forwards in this direction, but it will always look weak relative to the grandiose claims being made in the academic literature and there’s nothing we can do to change that unless we fundamentally change (dare I say compromise) the goals of the project.

Truthful advertising is more like “Stan gives you the tools to spend lots of time thinking really hard about your model and how it interacts with your data so that you can build robust inferences.” That will always be unglamorous and indeed outright anathema to those looking to churn out quick, often sloppy inferences. It is a miracle, however, for those who have burned by sloppy inferences in the past and are committed to putting the work in to build better analyses.

We can’t win the advertising game when the game is build on false premises and misunderstandings, especially when the only way to move towards more principled conversations is to ask people to spend a ton of time learning statistics. The incentives for other projects are entirely different from our own and we cannot win at their game. No one wants to take the hard solution when so many seemingly credentialed people are promising easy solutions.

All we as a project and community can do is to keep pushing the technology in Stan as far as possible while putting out case studies and other tutorials/lectures/vidoes/etc demonstrating problems and their resolutions in Stan, especially case studies written for particular fields, so that when people do recognize how challenging this all is and are ready to commit the necessary time they can find us and have the resources that they need.

28 Likes

Is it possible yet to include other PPLs in posteriordb? Yet?

Also it would be cool to use posteriordb for something like sotabench but for Bayesian models

Hi, just three things:

  1. There’s some talk of advertising. I think advertising can be good! Advertising is no substitute for improvements in the software, documentation, and community. But advertising can be useful in making people aware of improvements in the software, documentation, and community. Even serious Stan users can be unaware of everything that Stan can do.

  2. Speed can make a difference. Stan users fit a lot of hierarchical linear and logistic regressions with thousands of data points and hundreds or thousands of predictors. These are real problems. There are ways to code Stan models so they can run fast for these problems, and it’s good to make users, and potential users, aware of this.

I’m one of those users! I had a real applied problem with something like 150,000 data points and somewhere between 70 and 1000 predictors (depending on which model I used). I gave up on fitting it in Stan because it was too slow (this was a few years ago). Then recently Bob showed me how to make it run something like 100 times faster.

  1. A separate and also relevant issue is the advertising that can be necessary to counter misconceptions. This recent discussion got started because people were going around saying that Stan was slower than other software–but it turns out they were running an inefficiently-coded Stan model. There’s no need to have people thinking that Stan is slow for problems that it’s not!

  2. I disagree with Michael’s statement that the Bugs examples are terrible. They are real models that people have fit, and I’ve seen no evidence that fitting them in Stan gives “the wrong answer.” I guess I wouldn’t be surprised if such examples exist; in that case it would be good for someone to flag them. One issue that we have discussed several times over the years is that many of these models follow what we would now consider poor practice, with things such as inverse-gamma(0.001, 0.001) priors. So there’s always been a tension between coding the models as is (to provide continuity to former Bugs and Jags users) and coding them as we would prefer. Now that the number of active Bugs and Jags users is dwindling (not to zero, but it’s a less important audience than before), maybe we don’t need to worry so much about the old versions of these models. Still, they remain a useful set of example models for many potential users.

  3. I agree with Michael that we should not be trying to “win the advertising game.” The point of any advertising should be to honestly state what Stan, and other statistical software, can and cannot do. I was motivated to get this discussion going because I think there’s are some misconceptions out there about Stan that we could fix.

We have lots of forums for cleaning up confusion, including this forum, our priors wiki, our blog, the future Stan newsletter, the courses we offer at Columbia, Aalto, etc., Michael’s short courses, etc. Once we have clear results from posteriordb and any other comparisons we do, we have lots of ways of spreading the word.

7 Likes

+1 on SGB supporting posteriordb

IMHO it’s a bad idea to start down the road of defending Stan for speed. I wholly agree with @bbbales2 and @betanalpha; stick to bespoke modelling for science and focus on the distinctions those people find important. I’d imagine modelling flexibility will top speed as a primary concern any day. Personally, I think if a model won’t finish in a reasonable time unless it’s non-centered, reparameterised, cholsky decomposed and map_rected then it’s probably not a good fit for Stan and we should just say that. Its worse to waste time and be disappointed in the end.

As another example, I think we should be honest about prospects for non identifiable models. I wasted weeks trying to fit mixture models which the documentation makes look perfectly feasible but which are not so in the general case and I see the same with some peoples fruitless forays into CP and LDA models too.

Reason I think Stan shouldn’t be compared for speed is that more of the general audience (including the ML crowd) care about comparisons between substitutable methods rather than comparisons of different implementations of the same thing, and its easy to generate comparisons in which Stan fails miserably: just pick a tool with a killer implementation like XGBoost, random forest, PLS, AP, word2vec, etc and compare it to an attempt to do the equivalent in Stan.

Finally, IMHO, going the route of correcting the wrongs of the world through advertising and PR is a slippery slope and a trap. As @betanalpha said wisely, Stan will lose if it tries to distinguish itself in those places where opinion is made by multitudes and talking heads.

1 Like

Emiruz. I agree that we should be honest. Part of honesty is that Stan is actually not slower than PyMC3 for logistic regression! Stan is relatively fast for some applications and relatively slow for others. That’s fine.

When there are misconceptions out there, I think’s good for us to fix them. Aki’s work on posteriordb should be a good way of doing this. I don’t think this will cause us to “lose.” We want to provide useful and accurate information to help potential users make decisions.

1 Like

Following is just IMHO as always.

I think it’s a good reason to have our own benchmarks, but does it motivate PR or advertising?

Definitely, yes! But as @bbbales2 said, by ourselves. I.e. just point it out to the author or submit a comment or whatever; and besides that have our own comparisons. This as opposed to a marketing drive that tries to out manuveaur the other marketing departments which I mean with reference to the following point made by @betanalpha:

Emiruz:

If the term “advertising” is too loaded, we can use the term “communication.” If there’s confusion out there, it can help to communicate the truth to clear up misconceptions. This motivates a lot our documentation. In clearing up misconceptions, we can do better than just pointing things out to individual authors; we can write and publish documentation that reaches wider audiences of people who would otherwise be confused.

It’s fine to ask people to spend a ton of time learning statistics (or, more precisely, to give them resources to allow them to make effective use of the time they have available); this can go on in parallel with our writing documentation and preparing reports that clarify points of confusion. For example, the claim that PyMC3 is twice as fast as Stan for logistic regression is false; that’s a misconception we can clear up. If there are any misconceptions the other way (people thinking Stan is faster when it’s not), we can clear that up too!

4 Likes

@mans_magnusson knows better the status

Hi!

I’m actually working with this right now, so it will hopefully be done in the next couple of weeks, but probably not before easter.

/Måns

I think a gentle step toward PR management, without the unpleasant aura of crisis management / chasing down detractors, would be a publicized Stan blog that addresses topics of general interest to the community Stan is targeting (comparative speed between languages, model optimizations, new options) and regularly posts about them with some basic repository code people can work through. Ideally the blog would also take questions and regularly answer one or two of them at the end of posts, to encourage user engagement.

The tone should be light and willing to admit when Stan arguably could do better but also pointing out when supposed differences between languages are illusory or truly model-dependent. These blog posts would be quite different from a case study, but could be used to feature new case studies or arguments under a section that highlights a few interesting new applications of Stan each time, perhaps after answering a user question or two.

I suggest this because many people who are reluctant to wade through a bulletin board will happily look through an interesting blog post over lunch and note to themselves that it reflected well on Stan or taught them something they did not know and might want to study more later. It’s remarkable how many people I have talked to know about and enjoy Stitchfix blog posts about modeling. Funding that included a partial responsibility to manage such a blog would be well spent in my opinion.

Just my thoughts.

7 Likes

I absolutely agree with @bachlaw. A dedicated Stan blog separate from statmodeling.stat.columbia.edu would be very good for communication. As is, there are posts about Stan mixed in that get lost in the amount of posts on Stats Modeling.

Anecdotally, it does seem like Stan gets used as the benchmark punching bag by lots of software packages (lately, I’ve seen people also trumping the speed of Turing.jl over Stan), which means that the Stan team has certainly done a decent job already of communicating its existence and value. I do think that R and Python programming languages, perhaps not through a central effort but rather through a distributed one of its users, has set an example of how to remain appealing compared to other solutions emphasizing raw speed and performance by showcasing that ease of use, versatility, familiarity, organization of documentation, and a helpful userbase leaves people satisfied with something fast enough. For instance, Julia is certainly growing in use, but I think some of its community members were predicting a more rapid and meteoric overthrow of R and Python, but enough people have been satisfied by R and Python that they haven’t felt the need to jump ship and re-code their projects for some speed improvements that they don’t feel really make enough of a difference in their day-to-day.

Stan has a wonderful, unpretentious community associated with it, and I give props to the Stan team for cultivating that from the top-down. And, I think that is a well-known quantity. In a thread on Reddit the other day discussing the sharing of BDA3, people in the comments spoke about how the Stan community and helpfulness of the devs in troubleshooting code and teaching statistics was one reason why people did use Stan. I think the next most effective things to address subsequently, related to what @Bob_Carpenter was saying, is streamlining the documentation and collecting the examples, case studies, and vignette articles in a more concentrated place on the website. I’ve been navigating between the older Stan case studies pages, LOO vignette articles, the documentation, various Github pages, and the Stats modeling blog, and without some hand-holding from the community, I never would have intuitively found some of those pages.

10 Likes

I agree too. We’ve talked several times in the Stan meeting about having a Stan Newsletter in blog form. Jonah has said he’d bring this up with the SGB so maybe it will happen soon.

1 Like

I don’t know if people still remember this from back in the thick of the browser wars on javascript engines. I think it was put up by Firefox people. It’s an auto-generated page that shows equivalent code running on the latest versions of major browsers (used to be IE, Chrome, FF).
https://arewefastyet.com/

For SEO purposes, the equivalent page should be something like mc-stan.org/speed and the title of the page should be Stan Speed Benchmarks or similar, with your competition’s results and configurations used listed.

Anybody messes with you (writes a paper about you) their name (benchmark) goes on the list!

3 Likes

I think we are on the same wavelength. A site like sotabench where there’s a bunch of stan models and other people can submit posteriorDB schemes to compare against stan

1 Like

I am brand new to Stan, so wary of jumping into a governance thread, but I thought my reasons for exploring Stan fit with the topic. Our research frequently uses mixed logit choice models, or what might be alternatively classified as hierarchical softmax models.Recently, I have been exploring Bayesian estimation of our models. This was based on two recent papers in our field, both of which reference Stan computation time (referenced below). I am using a benchmark dataset in our field to compare various methods/packages for estimating a simple mixed logit model. I estimate the model in an R package using maximum simulated likelihood in 7.6 minutes compared with ~35 minutes in Stan. Of course, the MSL method only gives point estimates. I also tried out TensorFlow Probability but found it more challenging to specify and debug the model, as well as being quite slow to run.

Foundations of Stated Preference Elicitation: Consumer Behavior and Choice-based Conjoint Analysis
Moshe Ben-Akiva, Daniel McFadden, and Kenneth Train
They propose using the common method in our field: the Allenby-Train method. This method relies on a normality assumption for the mean priors and inverse wishert for the covariance priors. In reference to Stan, they state:

The [Stan] sampler is flexible in specification of priors and likelihoods, and is not limited to the multivariate normal model described above. A drawback of this flexibility is that STAN may run more slowly for this model than procedures that use computationally efficient Gibbs samplers for part of the multivariate
normal iteration.

Their code is available online and I can see that they take 11,000 iterations and use a single chain. They provide no measures of convergence.

The other set of papers is by Bansal, Krueger, et al. (https://arxiv.org/pdf/1904.03647.pdf and “Random taste heterogeneity in discrete choice models: Flexible nonparametric finite mixture distributions”). They state:

We also explored the use of Stan as part of the current research study but found that estimation times were prohibitive for the sample sizes considered in the simulation evaluation presented in Section 5. Our experiences with Stan are generally consistent with the literature. Ben-Akiva et al. (2019) contrast NUTS with the Allenby-Train procedure and find that both methods perform equally well at recovering the true parameter values. However, whereas the reported estimation time for the Allenby-Train procedure is 12 minutes, NUTS had to be run “overnight”. Vij and Krueger (2017) attempted to use Stan to estimate a MMNL model on a large dataset containing 30,166 observations from 17,700 individuals but were unable to do so due to memory constraints. A possible avenue for future research is to custom-code a NUTS procedure with analytical gradients to enable fast and scalable posterior inference for MMNL.

The authors end up proposing a VI algorithm (published in our most prestigious journal). I imagine that between Stan, PyMC, TensorFlow, etc. there have been models with 30,000+ observations. I know that changing my NUTS code to VI/VB was as simple as changing a single line (the model runs in ~1.5 minutes for full rank estimation in RStan. Also, I am very new to Bayesian estimation, but my understanding is that Stan uses automatic differentiation and it’s unclear to me that custom-coding analytical gradients will help with speed (also factoring in coding time).

I also compare runtime against an implementation of the Allenby-Train algorithm in R. Indeed, it runs faster than Stan. However, the only convergence metric given is Geweke’s statistic. I ran the Allenby-Train algorithm with 1000 burn-ins and 20,000 samples, then calculated the effective sample size. I get ~740 for the parameter with the smallest ESS. Using Stan (same burn-in but only 2000 samples), I get an ESS of 513 for the same parameter. I also ran 4 chains in Stan, whereas the others are only running a single chain.

I am not sure if this is of interest to your team. I thought I would send my thoughts along since I am currently playing around with various software packages for a subset of hierarchical model estimation problems. Stan was by far the easiest to implement the model in and the forum/debug message are excellent.

6 Likes

I have a small point to contribute which I didn’t see addressed so far. For context, we have been using Stan for since 2.13 and now have an EU country clinical trial for epilepsy running based on our Stan modeling work (1, 2).

Our forays into Pyro, PyMC3, TFP, Edward, etc ended with a U-turn (is that a pun?) back to Stan. Our parametrizations and not Stan’s execution speed have always been our performance problems. As a computational engineer, I see that Stan’s implementation choices (double precision, Eigen, templated C++) precludes, broadly, high performance in the fashionable HPC sense of just-in-time generated, highly-SIMD GPU kernels, unrolled forward-backward passes, and other more exotic devices, which while put forward in languages like Julia or Python, resemble research projects and are of dubious value in the long run, for a computational modeling project or group (who’s going to maintain that incomprehensible code once the authors moves on from his/her PhD? have you ever read the code for the autograd package in Python? there’s a reason they moved on). The value in Stan for us, and the larger community, is a seemingly Pareto-principled approach to implementation [1] with commitment to getting things right & robust, not just checking off boxes, which we couldn’t find elsewhere.

I think this can figure somewhere within a speed discussion, especially if the scale of resources involved goes beyond a laptop and/or a short term project, in the same spirit as one hesistantly mentions -ffast-math to a new C/Fortran programmer.

[1] apologies to Stan implementors who might take offense, I mean the best compliment here

13 Likes

Most regularly-held courses about Stan, that I’ve noticed at least, are on the basic, this is how you code a few models in Stan, how you do basic model checking, and how you interpret the parameters, coupled with some background in probability or such.

But I haven’t seen short courses in how to optimize Stan code, especially from a computational perspective, given that the modeler has already mathematically identified a plausible model.

5 Likes