Any way to make Stan competitive with Tensorflow for maximum likelihood?

rok_cesnovar · July 23, 2019, 7:25pm

I decided to dig a bit more in this. I ran the tests on my Ubuntu machine with the Intel Core i7-4790 @ 3.60GHz, a Radeon VII GPU and an Intel SSD hard drive.

I timed the execution of the following:

tensorflow with a CPU (I am currently using the Radeon GPU that tf does not support, will report tf GPU times when I switch to NVIDIA)
rstan 2.19
rstan 2.19 with glm
cmdstan 2.20
cmdstan 2.20 with glm
cmdstan with the experimental Stan Math branch that has the bernoulli GPU GLM
pystan 2.19
pystan 2.19 with glm

First to the really good - the execution times without I/O:

tf CPU 15.86s
tf GPU 2.1s
tf glm GPU 0.5s
cmdstan - 27s
cmdstan glm - 8.6s
cmdstan gpu glm - 0.6s (on both the Radeon VII and Titan XP)
rstan - 36s
rstan glm - 15.7s

Not so sure in these timings:

pystan 76s
pystan glm 73s

And now for mixed bag of great and bad - the I/O:

The x.csv is 6.5 GB and the y.csv is 25MB. For cmdstan I converted them to a json file with 1.9GB.

Rstan:
For Rstan its actually great. I am reading the 2 csv files here, not the json file. R gets those 2 files in in about 20 seconds.

Cmdstan:
In order to get the optimize() execution time I added a chrono timer in C++ to time just the optimize call in the command.cpp, which are the results shown above.

The I/O for cmdstan is actually 5.5 to 6 minutes, which seems excessive for reading a 2GB .json file.
I was not able to stan_rdump the data to test the .R format as I ran out of RAM generating it.

For Pystan I used the tf generated values as input (the same way as the blog post) so I didnt time those.

I am going to replace the GPU with an NVIDIA one to also test that with TF and cmdstan. Not sure if I will manage today since replacing AMD drivers to NVIDIA on a Ubuntu machine is a bit stressful.

EDIT: the “not” in pystan should not be that much slower compared to rstan.
EDIT2: updated pystan timings, added GLMs

rok_cesnovar · July 23, 2019, 7:28pm

I just want to say that I am really proud of how the GPU core of the Stan Math is shaping up and judging by the results above we are on the right path. So a huge thanks and thumbs up to @stevebronder @tadej @seantalts @Erik_Strumbelj etc.

The GLMs from @Matthijs that were finished by @tadej also seem to be a huge win

ahartikainen · July 23, 2019, 8:04pm

Looks really good.

For pystan, compile model first, then do the sampling step.

import pystan
import time
model = pystan.StanModel(...)
t0 = time.time()
fit = model.sampling(data=data)
duration = time.time() - t0

rok_cesnovar · July 23, 2019, 8:18pm

Ok, yeah. That should do it :) Sorry…

I updated the post above. Its weird to me that there is only a 3s difference between the GLM and the basic bernoulli model in Pystan. I am probably missing something there.

ahartikainen · July 23, 2019, 8:21pm

That probably means there is still some io that causes the difference. Is RStan timing measured the same way or are those timings “Stan” timings?

rok_cesnovar · July 23, 2019, 8:23pm

That is what I thought yes, there must be some IO going on there.

For Rstan I took the same approach as the blog post:

tic("rstan")
fit <- optimizing(model, data = stan_data, hessian = FALSE)
toc()

ahartikainen · July 23, 2019, 8:26pm

My hunch is that we might transform the data from numpy arrays to list, which probably takes a long time with huge data.

Good to know we can/need improve this.

seantalts · July 24, 2019, 6:51am

Agreed, I made an issue for it in the Stan repo (where the JSON parser lives): stan::io::json parser is 10x slower than python's ujson · Issue #2776 · stan-dev/stan · GitHub

[edit] You can also use the weird rdump format with CmdStan and it takes half as much time as its json parser, though that’s still not great.

@rok_cesnovar did you try TF with XLA?

rok_cesnovar · July 24, 2019, 7:17am

Thanks!

I tried doing that, but my R session crashed on the stan_rdump call. I have 16GB of RAM which should be enough. I will run that again to see if it uses the entire RAM or I have R configured wrong.

No, I ran this with what the blogpost author used. I have a plan to try TF with the Titan XP GPU in the afternoon once I run out of steam for coding/writing :) Will try with XLA also, once I figure out how to incorporate it.

stevebronder · July 24, 2019, 8:16am

Speaking of XLA, would there be some way we could let the user pass data (or information about the data) to the new Stan compiler? I could imagine training a model with a fixed amount of data. Letting Stan/Eigen know about those fixed sizes at least would probs give a lot of advantages

seantalts · July 24, 2019, 8:31am

The Math library isn’t general enough to deal with Eigen’s fixed-size matrix/vectors… we really need to do that rewrite to use either Eigen Refs or Eigen’s templated base classes (maybe EigenBase is most general?).

But we could use that information very nicely for optimization in stanc3. You can currently do this by just defining these as constants in your Stan model, though it’s very frustrating you can’t define e.g. N=20 for use in the data block (perhaps an argument to collapse the transformed data and data blocks into a single block?). Otherwise we could have the compiler take that information in at compile time and generate a model that only works with those sizes and throw an exception if any other sizes are passed in. Sort of a weird UX though. If we ultimately ship an interpreter, that distinction totally blurs and we can do optimization at runtime based on all of the actual data.

jeffpollock9 · July 24, 2019, 4:33pm

I’ve not made my way through the stan GPU stuff yet, but I’ve ran a short test (using the same data as before) using the GLM functions in both tensorflow-probability and stan and noted them here:

TLDR:

tf (nvidia 1060 GPU, float32): about 2 seconds
stan (i7-8750H CPU, 1 core, float64): about 11 seconds

IMO the stan performance is really impressive, and a big improvement from my last try!

jeffpollock9 · July 24, 2019, 5:57pm

I also wasn’t able to get rstan::stan_rdump to work but not because I ran out of RAM (32GB):

> N <- stan_data$N
> P <- stan_data$P
> x <- stan_data$x
> y <- stan_data$y
> rstan::stan_rdump(names(stan_data), file="bernoulli-glm.data")
Error in paste(as.vector(vv), collapse = ", ") : 
  result would exceed 2^31-1 bytes

Will try with json tomorrow.

jeffpollock9 · July 25, 2019, 10:03am

I’ve got this working on the GPU now, adding some timings around the stan::services::optimize::lbfgs call I get a runtime of about 1.2 seconds - very impressive!

What unfortunately wasn’t so impressive was the amount of time the whole thing took, which was about 7 minutes :-(

I wrote the data to json using python-rapidjson which writes the json in about 140 seconds and can read it in about 120 seconds. Perhaps you could consider using rapidjson in stan?

ahartikainen · July 25, 2019, 10:20am

Have you tried ujson?

jeffpollock9 · July 25, 2019, 10:41am

Just tried it now, ran it a few times in disbelief to be honest. Reads the data in about 25 seconds? I guess rapidjson isn’t so rapid anymore!

rok_cesnovar · July 25, 2019, 10:59am

Thank you for going that deep and also confirming my tests on different hardware!

Trying to use ujson or rapidjson or whichever library will be best in terms of speed, licensing, etc is something we are exploring and hopefully will be available soon at least on the develop branch on github. It wont be a part of an official release before October when 2.21 will be released.

sakrejda · July 26, 2019, 10:18am

What are you trying to move data in from? It takes a smallish c++ class to read directly from whatever you have

jeffpollock9 · July 26, 2019, 10:23am

I wrote the data to file (bernoulli-glm.json) and ran the command ./bernoulli-glm optimize init=0 data file=bernoulli-glm.json. I’ve never used cmdstan before so please forgive me if there is a better way to get data into it (or if I have misunderstood your question).

sakrejda · July 26, 2019, 10:36am

I got from the thread that you were converting from something to JSON to get it into CmdStan. I’m curious what the something is because writing a shim for CmdStan to read from memory isn’t a big deal so all this effort going into reading from text feels like it might be unnecessary.

Topic		Replies	Views
RStan is slower than Pystan Interfaces	34	6618	March 2, 2018
Speed difference between rstan and cmdstan for a simple model CmdStan rstan , techniques	25	3559	November 7, 2021
Integrating GPU support Developers	23	5607	January 31, 2017
CmdStan slower than PyStan? Interfaces cmdstan , pystan	26	2712	April 17, 2019
Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups General	36	3583	March 5, 2018

Any way to make Stan competitive with Tensorflow for maximum likelihood?

Related topics