"Selling" Stan


#1

Hi all,

Yesterday, I gave a presentation, introducing Stan to my colleagues. I got a question from the audience that “If you are a businessman who is “selling” Stan, what are the messages to convince your audience to use your product”.

I am an user since I see it useful for my model at hand. I am not a developer, so to me this question is difficult to answer.

Do you have any idea that we should consider this new software package instead of JAGS or BUGS?

Thank you,
Trung Dung.


#2

I keep getting asked this question and I haven’t written an answer yet so here:

There are five main advantages to using Stan, particularly with its updated NUTS algorithm:

  • Stan is accurate. We have examples where samples from JAGS and BUGS are demonstrably from the wrong distribution. This is not due to an error in JAGS or BUGS but due to the fact that their samplers have trouble dealing with high dimensions and long tails (among some other cases). Stan’s NUTS sampler does much better in these (common) settings of producing samples from the intended distribution. This is slightly techincal in the sense that many applied statisticians accept the poor sampling and there are many published papers with complex models that just dont’t check and pass review and the most common effect I see is slightly wrong CI. Still, if researchers are ok with wrong/approximate credible intervals then they should at least be direct about it and use an approximate Bayes algorithm (e.g. variational inference).

  • Stan is easy to troubleshoot which saves the user time. The standard workflow in Stan includes useful diagnostics: R-hat for checking for lack of convergence, energy as a measure of how well the distribution is explored, treedepth and divergences to indicate whether _each iteration_terminates due to memory constraints or due to the termination criterion, etc… These diagnostics are more specific than those available for BUGS/JAGS (or most HMC implementations for that matter) and they save the user time. While there are diagnostics for mixing in JAGS/BUGS the lack of specific diagnostics often reduces even highly technical users to guessing at model problems and their solutions. We have case studies that demonstrate how to apply these diagnostics.

  • Stan is fast for large models. For a large model Stan’s compilation time is practically irrelevant to the model run time, so the main concern is how efficient the algorithm is. For high dimensions and hard models no general currently available full Bayes algorithm is faster than a well-tuned NUTS implementation. An auto-tuned implementation of RHMC is a good candidate for a replacement but I believe we’re as close to having that implemented reliably and with any generality as anybody else. Stan also doesn’t model discrete variables directly and instead we encourage users to sum out discrete variables which results in much (sometimes infinitely ) faster models (in terms of effective sample size per second).

  • Stan is memory-efficient. It stores the parameters and it stores extra variables for calculating gradient information but many large-ish models it only require a few GB of RAM. JAGS/BUGS store large data structures per node so models that have many nodes are prohibitively memory-intensive. For example this thing uses on the order of 100GB RAM. I don’t have a direct comparison for that because my Stan version sums out the discrete parameters so it’s only a few GB RAM).

  • Stan is scalable. In addition to being fast for large models, “we” (@wds15, @bgoodri really, among others) have identified computational bottlenecks in Stan that can be addressed using multi-core and cluster-based parallel computation. Both are demonstrated to work and are being merged into the development branch of Stan. BUGS/JAGS lack a usable way to insert within-chain parallelism (AFAIK the only way would be within the component block-samplers, and it already may be there with a parallel BLAS but the benefits are moderate. The MPI implementation @wds15 is working on scales extremely well with the number of cores (almost linear?). Essentially choosing an algorithm that produces a high effective sample size per sample (auto-tuned HMC) but requires a large computational effort per sample opened up room to parallelise MCMC which is really cool.

I could probably keep going but five is already a lot for a business pitch (I guess five slides, one of which says we make your highly skilled labor more efficient isn’t a bad pitch deck).

Krzysztof


#3

Side note about BUGS variants: AFAICT all of them are dead in terms of community involvement in development except for JAGS. JAGS is actually written in a modern programming language, has source code you can get and understand. Martyn is responsive and the API has gotten much cleaner than it used to be so there could conceivably be an effective community-based effort to improve it.

Side note about use-cases for BUGS/JAGS:

  • Small models that are clear when written in the BUGS language and known to work with JAGS samplers.
  • Computers that don’t have enough RAM to compile Stan models (e.g.-my HP stream can’t compile Stan models although it runs them fine). This is a real issue if like me you do work in any resource-limited settings.
  • Large-ish models where JAGS has a particularly appropriate block sampler available but something like rstanarm is not flexible enough.

There are probably a few more.


#4

I have been told that Martyn did come up with JAGS in the first place, because he wanted that a community would develop it. Hence he did choose C as the basis for it to make it easy for the majority of people to join the effort… but this has never really come to fruition apparently. As I perceive it he is still the only maintainer.

Count the # of devs for Stan in comparison to that…


#5

Yes. I looked into contributing to JAGS early on (I had a specific block sampler I wanted to add for time-series with t-distributed unobserved innovations). The deal-breaker was a lack of documentation on the API and no public roadmap for its evolution. That said I think the stage of JAGS where Martyn rapidly changes the API is over so if you had a specific block sampler to add it would be a better choice than just publishing you own buggy hand-rolled implementation in some obscure code repo.


#6

NIMBLE is still going, the R embedding of a BUGS-like language.

Depends who I’m selling it to. There are a lot of models you can fit in Stan (using Hamiltonian Monte Carlo) that can’t be fit in BUGS/JAGS/NIMBLE because Gibbs doesn’t scale with dimension well. There are lots of models that fit much faster and more scalably in Stan. There are multivariate models you can’t even express in BUGS or JAGS that you can fit in Stan. Stan can be run in R or Python.


#7

As @Bob_Carpenter notes, any successful sell of Stan is going to require a bespoke argument for the particular audience, which is especially hard given just how misinformed people are about so many things in statistics.

For example, when discussing people who regularly use BUGS/JAGS/NIMBLE they often center the conversation on irrelevant performance metrics like total run time or time per iteration instead of the metrics that matter like effective sample size per time. Consequently to properly argue that Stan is an improvement you have to first convince them to think about performance in the correct way which is itself a challenging task.
And even that discussion assumes that all of the samplers are converging fast enough for effective sample size to be meaningful. Most people aren’t even aware of this and can become very defensive when you try to bring up the possibility of sampler bias and the need for diagnostics!

On the other hand, when discussing with a hardcore machine learning practitioner you often have to deal with moving goalposts as they keep changing what they claim to be interested in computing so that they can maintain that they use all of the data or use so many cores in a cluster or have their algorithm terminate fast enough. Here if you want to argue why Stan can be an improvement you have to convince them that accurate uncertainty quantification is the priority (many people agree to this superficially) and that accuracy should hence the most important factor in algorithm choice (this is where people start to disagree quickly).

Ultimately I have found little success in trying to convince people to use Stan who aren’t already skeptical of their current tools and looking for a better solution. I’ve seen every bad excuse and honestly I rarely find it worth the discussion anymore. Fortunately there are many people who are looking for a better answer who are receptive to proper discussions and hence more than enough people to spend our time helping!

Long story short, if you have a particular audience in mind let us know and we might be able to provide more specific recommendations.


#8

My 2 cents:

  1. If someone is already fitting nontrivial Bayesian models and has enough knowledge about the process to interpret convergence statistics like Rhat and ESS, it will be an instant sell. The moment you demo Stan, they will start using it.

  2. The rest of the universe will not care, no matter what you do.


#9

This is all getting pretty cynical. We actually do fine recruiting users, even those who start with little knowledge beyond BUGS. Would be nice to survey users about where they were at when they started using Stan…


#10

Sorry, did not mean to be cynical, just realistic. I just think that in order to appreciate the comparative advantages of Stan, a certain amount of knowledge/experience is required. Especially with models Gibbs/RWMH-based toolboxes struggle with.

I think that people stick to tools they know, unless the accumulate enough frustration/challenges to switch. Just like with programming languages, people will not switch from language A to B solely because of B’s technical merits — it is much more likely that they grow aware of the limitations of A (which happens by using it a lot), then start looking for a new one, hopefully finding B.


#11

It’s a fine line :)

I’ve had pretty good luck getting people to switch if they had the resources and need. For biostats/ecology/biology the need is often there so it’s mostly a question of resources (commitment from supervisor or staff, time, money to travel to workshops, understanding enough math to write a likelihood, etc…)


#12

I was talking to a colleague last week and we both agreed that two very big reasons we like Stan so much are (1) the quality of the Stan community and (2) the quality of the documentation (both the manual and the curated collection of case studies).

We both really appreciate having this forum where we can ask questions that are sometimes dumb and still get helpful answers that explain where we are wrong-headed and suggest better ways to think about our problems, without being nasty or making us feel unwelcome.

The quality of a tool is almost completely irrelevant if you don’t have access to the resources you need to learn how to use it properly.


#13

Thank you so much, @sakrejda, for a list of better features of Stan comparing to JAGS and BUGS.

Thank others for your contributions in my question. I have learned something from this discussion:

  • I see that Stan has advantages in my model at hand but it does not mean the same for others. So if someone is still happy (in term of modeling or time) with JAGS or BUGS, it is difficult to convince them to change.

  • If choosing one software package to start for student to learn in a bio-statistics course, i.e. they have not known any Bayesian software packages yet, I think Stan is a very good option to start leaning. The reason are: a big supporting team, a detailed documentation, and a large group for discussion, a clear program (blocks components).

  • Like programming languages, no winner is in general. However, if Stan later can deal with discrete parameters without integrating out then I think Stan can replace JAGS and BUGS.


#14

That’s one of the reasons we’re trying very hard to reach out to grad students and new users among the scientific community. Which in itself plays into the next comment.

I can’t stress that enough. It helps that a lot of people besides the core Stan developers are publishing books, tutorials, and papers.

If you do have the resouces to learn it, the tool itself becomes important. It plays into things like being able to debug, evaluate, etc. Lots of that is tied into the ecosystem in R, Python, etc. We still feel like we’re playing catch up in getting all of our tooling up to the standards we’d like ourselves in our applied work.

Lots of things like developing more than one model in sequence is pretty painful with naming, cut-and-paste, etc.

We’ve been pushing pretty hardline evaluation of models plus software. We’re going to have a public paper on how to do all this soon based on scaling up the Cook-Gelman-Rubin diagnostics and making them more robust while retaining sensitivity.

In higher dimensions, neither Gibbs nor Metropolis is going to mix well.

It’s still not something we’ve been thinking about, but we should come back to it with Stan 3. The problem we have is that we don’t require users to write directed graphical models, so it’s hard for us to infer structure from a Stan program. That makes it hard to do discrete Gibbs efficiently. The reason we’re not motivated to literally add discrete sampling is that it’s horribly inefficient. And usually can’t recover the parameters from simulated data. What we would very much like to be able to do is automatically marginalize them out of a model (we could add samples if people want to do inefficient inference, or we can calculate expectations organically if people want to do efficient inference).


#15

I can add how I “sell” Stan to stakeholders and my managers at Zapier. The company collects a lot of data. When we have a problem or think there may be an opportunity, we start exploring the data. We’re looking for “insights” aka parameter estimates (!). Stan is an amazing insight generator. In fact it’s the best such generator I’ve seen. The operator learns the syntax and statistics, data goes in, and insights come out. Not only do insights come out but predictions can come out. We can also use it to analyze our more important A/B tests. Stan’s insights (parameter estimates) can also be saved. Encoding the likelihood is a “white box.” Insights can be extracted from ML (machine learning! not MLE :laugh:) but it’s typically more of a pain (a la random forest marginal estimates in terms of computation, or thin-in-value VIMP).

My company’s main interest in stan are the problems it can solve. :-D

  • Anomaly detection (digital radar using models)
  • Forecasting
  • Insight generation
  • Prediction

Obviously we rarely put things in these terms. We’re a terse bunch of scientist-statisticians. Take it from an early “data scientist” — translating the language matters!