Any way to make Stan competitive with Tensorflow for maximum likelihood?

That probably means there is still some io that causes the difference. Is RStan timing measured the same way or are those timings “Stan” timings?

That is what I thought yes, there must be some IO going on there.

For Rstan I took the same approach as the blog post:

tic("rstan")
fit <- optimizing(model, data = stan_data, hessian = FALSE)
toc()

My hunch is that we might transform the data from numpy arrays to list, which probably takes a long time with huge data.

Good to know we can/need improve this.

1 Like

Agreed, I made an issue for it in the Stan repo (where the JSON parser lives): stan::io::json parser is 10x slower than python's ujson · Issue #2776 · stan-dev/stan · GitHub

[edit] You can also use the weird rdump format with CmdStan and it takes half as much time as its json parser, though that’s still not great.

@rok_cesnovar did you try TF with XLA?

Thanks!

I tried doing that, but my R session crashed on the stan_rdump call. I have 16GB of RAM which should be enough. I will run that again to see if it uses the entire RAM or I have R configured wrong.

No, I ran this with what the blogpost author used. I have a plan to try TF with the Titan XP GPU in the afternoon once I run out of steam for coding/writing :) Will try with XLA also, once I figure out how to incorporate it.

Speaking of XLA, would there be some way we could let the user pass data (or information about the data) to the new Stan compiler? I could imagine training a model with a fixed amount of data. Letting Stan/Eigen know about those fixed sizes at least would probs give a lot of advantages

The Math library isn’t general enough to deal with Eigen’s fixed-size matrix/vectors… we really need to do that rewrite to use either Eigen Refs or Eigen’s templated base classes (maybe EigenBase is most general?).

But we could use that information very nicely for optimization in stanc3. You can currently do this by just defining these as constants in your Stan model, though it’s very frustrating you can’t define e.g. N=20 for use in the data block (perhaps an argument to collapse the transformed data and data blocks into a single block?). Otherwise we could have the compiler take that information in at compile time and generate a model that only works with those sizes and throw an exception if any other sizes are passed in. Sort of a weird UX though. If we ultimately ship an interpreter, that distinction totally blurs and we can do optimization at runtime based on all of the actual data.

1 Like

I’ve not made my way through the stan GPU stuff yet, but I’ve ran a short test (using the same data as before) using the GLM functions in both tensorflow-probability and stan and noted them here:

TLDR:

tf (nvidia 1060 GPU, float32): about 2 seconds
stan (i7-8750H CPU, 1 core, float64): about 11 seconds

IMO the stan performance is really impressive, and a big improvement from my last try!

3 Likes

I also wasn’t able to get rstan::stan_rdump to work but not because I ran out of RAM (32GB):

> N <- stan_data$N
> P <- stan_data$P
> x <- stan_data$x
> y <- stan_data$y
> rstan::stan_rdump(names(stan_data), file="bernoulli-glm.data")
Error in paste(as.vector(vv), collapse = ", ") : 
  result would exceed 2^31-1 bytes

Will try with json tomorrow.

I’ve got this working on the GPU now, adding some timings around the stan::services::optimize::lbfgs call I get a runtime of about 1.2 seconds - very impressive!

What unfortunately wasn’t so impressive was the amount of time the whole thing took, which was about 7 minutes :-(

I wrote the data to json using python-rapidjson which writes the json in about 140 seconds and can read it in about 120 seconds. Perhaps you could consider using rapidjson in stan?

2 Likes

Have you tried ujson?

Just tried it now, ran it a few times in disbelief to be honest. Reads the data in about 25 seconds? I guess rapidjson isn’t so rapid anymore!

2 Likes

Thank you for going that deep and also confirming my tests on different hardware!

Trying to use ujson or rapidjson or whichever library will be best in terms of speed, licensing, etc is something we are exploring and hopefully will be available soon at least on the develop branch on github. It wont be a part of an official release before October when 2.21 will be released.

2 Likes

What are you trying to move data in from? It takes a smallish c++ class to read directly from whatever you have

I wrote the data to file (bernoulli-glm.json) and ran the command ./bernoulli-glm optimize init=0 data file=bernoulli-glm.json. I’ve never used cmdstan before so please forgive me if there is a better way to get data into it (or if I have misunderstood your question).

I got from the thread that you were converting from something to JSON to get it into CmdStan. I’m curious what the something is because writing a shim for CmdStan to read from memory isn’t a big deal so all this effort going into reading from text feels like it might be unnecessary.

Ah ok, sorry I did misunderstand your question.

I created the data in python and have it in CPU/host memory as a numpy array. If there is a better way to transfer this data into cmdstan then I would love to try it!

As @seantalts points out, fixing that’s on our to-do list. It hasn’t been a priority in the past because bit optimizations are the only place where I/O winds up being a measurable part of run time.

I wrote out 125M double values in ASCII. It took 1.4 GB on disk and takes about 2s to read in on my Mac using an ifstream and <<.

#include <fstream>
#include <iostream>

int main() {
  long I = 125000000l;
  std::string path("/Users/carp/Desktop/foo.csv");

  std::ofstream f;
  f.open(path);
  for (long i = 0; i < I; ++i) {
    double a = i / 3.0;
    f << a;
  }
  f.close();

  std::ifstream g;
  g.open(path);
  double total = 0;
  for (long i = 0; i < I; ++i) {
    double a;
    g >> a;
    total += a;
  }
  g.close();

  std::cout << "total = " << total << std::endl;
}

I think @rok_cesnovar mentioned he was actually working on the JSON reader situation now :)

I think this is the kind of thing that silently bites us in benchmarks - if you just measure how long the process took to execute, you get a number including I/O, and if your incentive is to find settings that maximize your new software’s advantage against Stan you’ll end up choosing large JSON data sets with low iteration counts to exacerbate that advantage. I bet this has already happened a bunch…

2 Likes

I have a PR up https://github.com/stan-dev/stan/pull/2793
Also see the related issue.

Its marked WIP but I am mostly waiting for a green light that this is acceptable before I start cleaning up and optimizing. Tadej suggested to try simdjson, will try today just to see if its worth the hype.

EDIT: simdjson is not suitable, assumes you have a C++17 compiler, so if anyone has time to look a the questions posted in the PR, that would be much appreciated.

For a 200MB JSON we get from 35s to 2.5s with RapidJSON and for the 2GB we get from 350s to 25s on my system. For the latter file ujson in python takes 30s.

6 Likes