Edward vs. Stan performance

Hey,

@Matthijs and I were just talking about Edward’s performance numbers in the Deep Probabilistic Programming paper and I was wondering if we had some response to it? Do we agree that it is faster if run 1 CPU vs 1 CPU for each? I know know there are some issues raised with the comparison but I haven’t seen them published anywhere:

  • NUTS is not implemented, though apparently Matt Hoffman is working on that for Edward now. From what I understand that means Edward has pretty much no practical statistical modeling usefulness until that is released.
  • Edward is single-precision (because Tensorflow is single-precision), which could effect HMC’s stability and usefulness (but how much?).
  • Stan now has specialized GLM functions that provide speedups that likely bring it within range (or surpassing) Edward for these types of functions, though in some sense this is not fair it was maybe originally a case of Edward’s choosing a single Edward-friendly benchmark for comparison in that paper, and our coding-to-the-benchmark to catch up might be reasonable in that light. It would be interesting to me to see a model in which neither has been specialized.
  • Obviously GPU and MPI stuff coming down the line will allow us to compete with their multi-CPU and GPU benchmarks, but for now I’m just wondering about the single CPU case.

Anyone have further thoughts here? Have we done our own benchmarks?

  • To back-out a single CPU comparison, you’d have to know how Edward scales with CPUs.

  • Matt Hoffman explains in his JMLR paper on NUTS why static HMC will be a struggle for full Bayesian inference. That doesn’t mean Edward has no practical usefulness; my understanding from talking to Matt and Dustin Tran is that they’re more focused on deep belief nets and variational approximations than full Bayesian inference.

  • Last time I looked, Edward was all single precision. I don’t know how fundamental this is built into TensorFlow. Lack of double-precision is one of the advantages I see cited for MXNet over TensorFlow; operations like the big Cholesky factorizations required for Gaussian Process models become very challenging if not impossible in single precision.

  • For users on desktops, laptops, etc., multiple cores can be used for parallel chains (at least up to 4 or so); so you don’t need anything fancier with only a handful of cores. This is already built into Stan’s interfaces.

  • My understanding is that Stan’s specialized GLM functions (which have been merged with develop and will show up in Stan 2.18) will speed up Stan’s performance by a factor of 8 or so for logistic regression. That would make their hardware intensive solution only 5 times faster than Stan on a single core. It’d be nice to see the benchmark on a single CPU with no GPU support.

  • Yes, we should close the gap when we get GPU (OpenCL) and multi-core (MPI) integration, but that’s the future (hopefully not too far out given that prototypes are passing unit tests on branches). GPU speedups will only be available for models with a lot of data parallelism, like the big matrix-vector multiply in the logistic regression example. Multi-core (within box and over fast networks) is going to be more useful than GPU for most of the models we deal with, such as differential equation models, because almost every model has an embarassingly parallelizable likelihood.

1 Like

I agree with everything Bob said except this. For problems with big latent ODEs or big latent Gaussians, using mulitcore ODE or linear algebra libraries can lead to better performance than just running 4 parallel chains. Basically, performance on these operations typically scales sub-linearly over cores, but memory usage sales linearly.

The problem Dan brings up is that one or two chains might consume all of your memory. If you have a small computer or if you have lot of data or if your model builds a very large expression graph during autodiff, you can run out of memory before firing up four parallel chains. In these cases, running multi-threaded or even multi-process can be a win because threads can share memory and the spawned processes may only have to deal with a single shard of the data (be it a block of a matrix in a multiplication or a fold of data in a distributed likelihood calculation).

This isn’t usually a concern for someone with 8GB or 16GB of memory and four cores fitting models consuming all their local computer power. I’m probably not going to try to fit a model with 2GB of data on my notebook. But I do realize other people have smaller notebooks. Dan had a Macbook Air when I saw him last week; that might not even be powerful enough to compile Stan without swapping.

On the other end, we also run into scalability issues. If you have sixteen cores on your local machine, you won’t get much benefit in wall-time to usable answer. The problem there is that we want to run four chains for diagnostic purposes, but sixteen just increases our marginal effective sample size per unit time.

If the goal is a large effective sample size, MCMC is embarassingly parallel. If the goal is to find a single effective draw from the posterior (an effective sample size of one), running in parallel doesn’t help (other than the critical role of diagnosing cases where you don’t have a single effective draw).

P.S. Dan’s not disagreeing about what’s built into Stan. The capability to run parallel chains is built into RStan (and presumably PyStan; it’s easy to run in CmdStan).

The case I’ve got in mind is GPs where those big matrices suck up memory.

And my computer has 8GB of RAM. I call her Twiggy.

6 Likes

Last time I looked, Edward was all single precision. I don’t know how fundamental this is built into TensorFlow.

Single-precision is AFAIK important for computing on the GPU, regardless of the library you use: GPUs I have worked with (relatively cheap ones, though) have terrible performance in double precision: While in single precision, my computation would run some 20 times faster on the GPU than on the 8 core CPU, the total time became comparable with double precision. Might be different for some of the expensive GPU models, though (the Tesla P100 seems to have decent double precision performance).

2 Likes

Yes, GPs easily use lot of memory and even if I have 32GB of RAM in my laptop, right now other processes are using 22GB so Stan doesn’t have so much to use.

I’m pretty sure the people working on this for Stan have reported speedups on the order of 10–20 for things like 1000-dimensional Cholesky decompositions with double-precision on inexpensive ($200) desktop GPUs. Cholesky is the bottleneck for scaling Gaussian processes.

The really big benefit we’re going to get is from multiple CPU computations, which aren’t bound by data shape and let us pretty much embarassingly parallelize log likelihood plus gradient calculations, which are the big bottleneck now (this is also what GPUs speed up). These have led to massive speedup on dev branches for things like individual-patient ODE models.

Here’s the link to the GPU performance:

8–10 times for N=1000 and 20 times for N=4000.

It is true that NVIDIA has sadly gone in the wrong direction in terms of double precision support in the recent year or so (specifically the GTX series). The single precision performance is around 32:1 compared the performance of double precision on these GPUs. On GPUs for scientific use the ration is around 3:1.
This boost in single precision is mostly due to their focus on deep learning/tensors as those only require single-precision. On some GPU models, you can convince the driver to use the single-precision units to assist the double precision, getting a 4:1 ration or such. It does usually require that the GPU is not used for display purposes, but this is generally not a problem nowadays.

The performance of double precision is still not terrible. It may not be as impressive as it once was, but you can usually get a 5-10 fold speedup from a NVIDIA GPU that is similary priced as the CPU you are comparing it with. I do agree that getting those speedups is not as straightforward as it used to be on previous NVIDIA architectures.

On the other hand, AMD has far better perfomrance ratios for double precision, the ratio on the majority of their gaming or non-gaming GPUs is between 4:1 and 8:1. AMD has also recently (2-3 weeks ago) announced the partnership with Intel to put the AMD GPUs on the same chip in their next generation. This is probably quite some time away but this sounds promising.

We will be doing some more extensive analysis on the perfomance of both cholesky decomposition in the next week (different GPUs/CPUs). I will give a heads up here, if you are interested.

It general, it seems that we need double precision. But it might be worth investigating whether we can get away with casting to single precision when doing an LDL’ Cholesky decomposition. Without the square roots, it may be similar enough to the LL’ Cholesky decomposition in double precision.

There are potentially lots of places we might be able to get away with single precision. The problem is that these are often problem-specific in the sense of working for some shapes/sizes of inputs and not others. One alternative we could have would be single-precision versions of things that would be up to the user to control.

To be clear, I doubt we should give users the ability control the overall precision (although maybe we should have an undocumented option for testing). I was saying that if the whole Stan program is using vars backed by doubles, since the LDL’ Cholesky factorization is more numerically accurate than the LL’ Cholesky factorization, we might be able to static_cast<float>(value_of(A)) internally in the LDL’ function and get about the same precision as doing value_of(A) in the LL’ function.

I had an email conversation with Dustin and Matt when the paper was first
out. I wanted to clarify that it wasn’t a sampling improvement and what was
actually being improved was computation of the log probability function.

I haven’t seen the comparison code opened up, but I’d be curious to see it.
I asked to compare the trajectories (since it’s double vs single
precision), but I think Matt just compared the energies at the end. I’m
really curious if the implementations are correct by comparing where
trajectories end up from the same spot, same momentum, same stepsize, same
number of steps.

And I wonder what the real timing difference would be just running HMC in
Edward and NUTS in Stan (defaults) and comparing n_eff / time.

We don’t even get that degree of stability across compiler optimization settings in a single compiler on a single piece of hardware. You definitely won’t get the exact same answers using single-precision as double-precision. Whether the single-precision version is “close enough” depends on sampling and convergence properties, which I have to imagine differ wildly across model/data-set combos. Matt Hoffman thought it would be, Michael Betancourt thought it wouldn’t be. I’m guessing who’s right depends on the problem and data set.

1 Like

Seeing as I’ve had problems with gamma_p under double precision I have zero enthusiasm for anything complicated getting implemented in single precision (yes I"ll finish the PR some day). … would be fascinating to compare those methods under single vs. double precision though…

My opinion on this matter is that there is no good C++ library that incorporates good and flexible model building, efficient math computations and learning algorithm.

For me tensorflow is good engine for model building. But C++ API is disgusting and has no features to build models. And there is no way to implement efficient XHMC on python.

On the other hand Stan has insane C++ code for MCMC but I find it’s language very bad and unusable for efficient computations.

Thus I’m actually looking forward incorporating some high level library like tensorflow/torch to make models and use Stan for learning. For me it seems the most reasonable thing. Because it’s actually a huge work to make good model building engine on Stan (I miss many matrix operations magic and vector operations in stan, it supports them but in some uncomfortable way).

Could you be more specific? Which operations? In the Stan domain-specific language or in the C++ API?