Any way to make Stan competitive with Tensorflow for maximum likelihood?

I decided to dig a bit more in this. I ran the tests on my Ubuntu machine with the Intel Core i7-4790 @ 3.60GHz, a Radeon VII GPU and an Intel SSD hard drive.

I timed the execution of the following:

  • tensorflow with a CPU (I am currently using the Radeon GPU that tf does not support, will report tf GPU times when I switch to NVIDIA)
  • rstan 2.19
  • rstan 2.19 with glm
  • cmdstan 2.20
  • cmdstan 2.20 with glm
  • cmdstan with the experimental Stan Math branch that has the bernoulli GPU GLM
  • pystan 2.19
  • pystan 2.19 with glm

First to the really good - the execution times without I/O:

  • tf CPU 15.86s
  • tf GPU 2.1s
  • tf glm GPU 0.5s
  • cmdstan - 27s
  • cmdstan glm - 8.6s
  • cmdstan gpu glm - 0.6s (on both the Radeon VII and Titan XP)
  • rstan - 36s
  • rstan glm - 15.7s

Not so sure in these timings:

  • pystan 76s
  • pystan glm 73s

And now for mixed bag of great and bad - the I/O:

The x.csv is 6.5 GB and the y.csv is 25MB. For cmdstan I converted them to a json file with 1.9GB.

Rstan:
For Rstan its actually great. I am reading the 2 csv files here, not the json file. R gets those 2 files in in about 20 seconds.

Cmdstan:
In order to get the optimize() execution time I added a chrono timer in C++ to time just the optimize call in the command.cpp, which are the results shown above.

The I/O for cmdstan is actually 5.5 to 6 minutes, which seems excessive for reading a 2GB .json file.
I was not able to stan_rdump the data to test the .R format as I ran out of RAM generating it.

For Pystan I used the tf generated values as input (the same way as the blog post) so I didnt time those.

I am going to replace the GPU with an NVIDIA one to also test that with TF and cmdstan. Not sure if I will manage today since replacing AMD drivers to NVIDIA on a Ubuntu machine is a bit stressful.

EDIT: the ā€œnotā€ in pystan should not be that much slower compared to rstan.
EDIT2: updated pystan timings, added GLMs

4 Likes

I just want to say that I am really proud of how the GPU core of the Stan Math is shaping up and judging by the results above we are on the right path. So a huge thanks and thumbs up to @stevebronder @tadej @seantalts @Erik_Strumbelj etc.

The GLMs from @Matthijs that were finished by @tadej also seem to be a huge win

6 Likes

Looks really good.

For pystan, compile model first, then do the sampling step.

import pystan
import time
model = pystan.StanModel(...)
t0 = time.time()
fit = model.sampling(data=data)
duration = time.time() - t0
1 Like

Ok, yeah. That should do it :) Sorryā€¦

I updated the post above. Its weird to me that there is only a 3s difference between the GLM and the basic bernoulli model in Pystan. I am probably missing something there.

That probably means there is still some io that causes the difference. Is RStan timing measured the same way or are those timings ā€œStanā€ timings?

That is what I thought yes, there must be some IO going on there.

For Rstan I took the same approach as the blog post:

tic("rstan")
fit <- optimizing(model, data = stan_data, hessian = FALSE)
toc()

My hunch is that we might transform the data from numpy arrays to list, which probably takes a long time with huge data.

Good to know we can/need improve this.

1 Like

Agreed, I made an issue for it in the Stan repo (where the JSON parser lives): stan::io::json parser is 10x slower than python's ujson Ā· Issue #2776 Ā· stan-dev/stan Ā· GitHub

[edit] You can also use the weird rdump format with CmdStan and it takes half as much time as its json parser, though thatā€™s still not great.

@rok_cesnovar did you try TF with XLA?

Thanks!

I tried doing that, but my R session crashed on the stan_rdump call. I have 16GB of RAM which should be enough. I will run that again to see if it uses the entire RAM or I have R configured wrong.

No, I ran this with what the blogpost author used. I have a plan to try TF with the Titan XP GPU in the afternoon once I run out of steam for coding/writing :) Will try with XLA also, once I figure out how to incorporate it.

Speaking of XLA, would there be some way we could let the user pass data (or information about the data) to the new Stan compiler? I could imagine training a model with a fixed amount of data. Letting Stan/Eigen know about those fixed sizes at least would probs give a lot of advantages

The Math library isnā€™t general enough to deal with Eigenā€™s fixed-size matrix/vectorsā€¦ we really need to do that rewrite to use either Eigen Refs or Eigenā€™s templated base classes (maybe EigenBase is most general?).

But we could use that information very nicely for optimization in stanc3. You can currently do this by just defining these as constants in your Stan model, though itā€™s very frustrating you canā€™t define e.g. N=20 for use in the data block (perhaps an argument to collapse the transformed data and data blocks into a single block?). Otherwise we could have the compiler take that information in at compile time and generate a model that only works with those sizes and throw an exception if any other sizes are passed in. Sort of a weird UX though. If we ultimately ship an interpreter, that distinction totally blurs and we can do optimization at runtime based on all of the actual data.

1 Like

Iā€™ve not made my way through the stan GPU stuff yet, but Iā€™ve ran a short test (using the same data as before) using the GLM functions in both tensorflow-probability and stan and noted them here:

TLDR:

tf (nvidia 1060 GPU, float32): about 2 seconds
stan (i7-8750H CPU, 1 core, float64): about 11 seconds

IMO the stan performance is really impressive, and a big improvement from my last try!

3 Likes

I also wasnā€™t able to get rstan::stan_rdump to work but not because I ran out of RAM (32GB):

> N <- stan_data$N
> P <- stan_data$P
> x <- stan_data$x
> y <- stan_data$y
> rstan::stan_rdump(names(stan_data), file="bernoulli-glm.data")
Error in paste(as.vector(vv), collapse = ", ") : 
  result would exceed 2^31-1 bytes

Will try with json tomorrow.

Iā€™ve got this working on the GPU now, adding some timings around the stan::services::optimize::lbfgs call I get a runtime of about 1.2 seconds - very impressive!

What unfortunately wasnā€™t so impressive was the amount of time the whole thing took, which was about 7 minutes :-(

I wrote the data to json using python-rapidjson which writes the json in about 140 seconds and can read it in about 120 seconds. Perhaps you could consider using rapidjson in stan?

2 Likes

Have you tried ujson?

Just tried it now, ran it a few times in disbelief to be honest. Reads the data in about 25 seconds? I guess rapidjson isnā€™t so rapid anymore!

2 Likes

Thank you for going that deep and also confirming my tests on different hardware!

Trying to use ujson or rapidjson or whichever library will be best in terms of speed, licensing, etc is something we are exploring and hopefully will be available soon at least on the develop branch on github. It wont be a part of an official release before October when 2.21 will be released.

2 Likes

What are you trying to move data in from? It takes a smallish c++ class to read directly from whatever you have

I wrote the data to file (bernoulli-glm.json) and ran the command ./bernoulli-glm optimize init=0 data file=bernoulli-glm.json. Iā€™ve never used cmdstan before so please forgive me if there is a better way to get data into it (or if I have misunderstood your question).

I got from the thread that you were converting from something to JSON to get it into CmdStan. Iā€™m curious what the something is because writing a shim for CmdStan to read from memory isnā€™t a big deal so all this effort going into reading from text feels like it might be unnecessary.