Addressing Stan speed claims in general

I’ll second that. Sorry I missed the last month of conversation somehow!

I appreciate long-form responses, so thanks!

The result should still give you the right stationary distribution asymptotically. As far as assurances, I think the best you can do is simulated-based calibration for the algorithm and then posterior predictive checks for the fit.

For example, PyMC lets you alternate discrete sampling with Gibbs with continuous sampling with HMC. That’s something we can’t do in Stan.

Our original design was to do just that, but we finally realized that discrete sampling is very problematic computationally (it doesn’t mix very well).

What would be nice to have is some applications of simulation-based calibration to see if it works. The problem we foresaw is that HMC is very sensitive to tuning step size and mass matrix to the geometry being sampled, but that changes every iteration with alternating discrete sampling.

We don’t know how to get around that. Writing an efficient Stan program requires writing efficient and numerically stable arithmetic and also thinking about autodiff execution, which is unusual in its two passes, the second of which is interpreted.

I’ve become convinced that the benchmark we want is wall time to a minimum effective sample size of 200 or so. That’s what most inference demands. For those who need larger ESS, the important thing is ESS per unit time after warmup.

What we find a lot of us is people saying their iterations are faster than Stan without accounting for ESS or even whehter they get the right answer.

Is there something specific here you think we should be investigating (other than recoding Stan in a different language)?

I’ m not sure what you mean here, but some of the PPLs, like Edward, were explicitly motivated as algorithm test beds.

You mean that we require all the functions to be declared in the parser and documented in our documentation?

That’s 10,000 Julia developers over all? How many write the kind of differentiable code you can use in Turing.jl? I really like the basic idea and understand the argument for it.

My understanding is that it’s implemented something like the way “object oriented” code is implemented in R through just overloading functions of the correct signature rather than defining methods on objects directly as in other OO systems like C++ or Java. It’d be super cool if you had a lot of candidates that had robust and efficient autodiff.

How does someone know what’s available as a function in Turing.jl?

There’s the same issue with systems like PyMC3. They’re not really just importing any old Python code any more than Stan can import any old C++ code. It has to be tailored to their Theano back end.

How many people manage to do that? That’s what I was asking when I was asking how many of those 10K developers add robust gradients to their packages? We’ve definitely taken the more Apple-like approach on purpose to avoid shifting this burden to our users and to be able to control the overall experience. We know that becomes limiting in terms of functionality.

In case this wasn’t clear, we’re doing exactly the same thing in using tried and true ODE integrators that others have developed. We’ve evaluated a bunch of systems that people have told us were better than what we’re using, but we haven’t found ones that were better than what we have. I’m not saying Julia doesn’t have better solvers, just that the ones we’ve evaluated in C++ haven’t been great. It’s clear Julia devs think Julia has the best ODE solvers, so one option would be to translate those to C++ for inclusion in Stan. I would think that would result in faster code in the end.

Speaking of which, how many of Julia’s ODE solvers are written in Julia and how many are FFI from C/C++ or Fortran or something else? I have trouble with doc for packages like Julia’s that list 200 choices for ODE solvers—I’m not enough of an expert to wade through all the options.

I guess what I’m saying here is that having too many choices can also overwhelm new users, though many of them just cut-and-paste what their labmates did last year (which means no discovering new functions, but also hopefully not making poor choices).

We have had to think about adding sensitivites, though, since the popular off-the-shelf solvers don’t supply accurate versions of that.

@yizhang has been working on that for PDE solvers, but I’m not sure if he’s looking at a general FFI or just something that’ll allow C++ on the back end.

Right now, we only allow C++ extensions by users.

What’s a CFD simulation?

Right. The stats people don’t do a lot of diff eq solving and the diff eq people don’t do a lot of stats. I work with a bunch of diff eq solving specialists now in my new job.

The Stan developers understand quite well that there’s a vast array of useful models that aren’t so easy to code in Stan. SDEs are one of the things we have a grant at the moment to address. Neural networks are the 800 pound gorilla in this room, and we also don’t have effective ways of fitting those. Systems like TensorFlow Probability really lean heavily on their integration with good NN libraries. (I have no idea where Turing.jl is in that.)

It’s not that, it’s the point you made before—we don’t have 10,000 developers! We’re more than happy to take PRs around all kinds of new models we don’t currently fit efficiently or easily.

It’s come a long way since @syclik and I hacked together the first version that just autodiffed through the RK45 algorithm in Boost!

We’re certainly not trying to do that. We’re not even getting entangled in most of these evaluations because we usually can’t get people to do the evaluations we care about.

My advice is to do what we did for our autodiff paper on arXiv—write to all the developers of those other systems with your benchmark code and ask them if there’s a way to improve it before publishing misleading benchmarks. It’s a lot of work, which is why we usually don’t do it!

I agree completely. The TensorFlow Probability folks did a fairly thorough benchmark recently that didn’t make me cringe (which is saying something, given the number of bad evals I’ve seen in papers and email).

@mans_magnusson has a repo with our current thinking about developing benchmark examples that can be used by other systems. It has the model definitions and 10K draws with an effective sample size of roughly 10K. It’s not yet an official Stan project—it’s still in Mans’s personal repo. Looks like Julia has MCMCBenchmarks.jl. It looks like it somehow tied to McElreath’s book Statistical Rethinking.

Thanks for letting us know. The paper was very dismissive.

That’s the kind of thing that’d be great to add to Stan. All we need to communicate are parameter values/data in and the solutions and sensitivities out, so an interface wouldn’t be too hard.

We’ve tried to be very careful about this and in some cases trade speed for robustness. For example, @Adam_Haber just did a nice comparison of the correlation matrix Cholesky factor transform and derivatives. It turns out TensorFlow’s bijector used a different approach that was faster and worked fine with a lot of data, but was less robust with small data situations. So we decided not to change how we coded that.

Plumbing is hard! Especially if you want things to be easy to install and use and robust in operation across operating systems and platforms.

I think it helps to work on both fronts. Better algorithms aren’t enough by themselves, but they’re an orthogonal concer to better model parameterization (better for HMC being a posterior that’s more standard normal).

6 Likes