I decided to dig a bit more in this. I ran the tests on my Ubuntu machine with the Intel Core i7-4790 @ 3.60GHz, a Radeon VII GPU and an Intel SSD hard drive.
I timed the execution of the following:
- tensorflow with a CPU (I am currently using the Radeon GPU that tf does not support, will report tf GPU times when I switch to NVIDIA)
- rstan 2.19
- rstan 2.19 with glm
- cmdstan 2.20
- cmdstan 2.20 with glm
- cmdstan with the experimental Stan Math branch that has the bernoulli GPU GLM
- pystan 2.19
- pystan 2.19 with glm
First to the really good - the execution times without I/O:
- tf CPU 15.86s
- tf GPU 2.1s
- tf glm GPU 0.5s
- cmdstan - 27s
- cmdstan glm - 8.6s
- cmdstan gpu glm - 0.6s (on both the Radeon VII and Titan XP)
- rstan - 36s
- rstan glm - 15.7s
Not so sure in these timings:
- pystan 76s
- pystan glm 73s
And now for mixed bag of great and bad - the I/O:
The x.csv is 6.5 GB and the y.csv is 25MB. For cmdstan I converted them to a json file with 1.9GB.
For Rstan its actually great. I am reading the 2 csv files here, not the json file. R gets those 2 files in in about 20 seconds.
In order to get the optimize() execution time I added a chrono timer in C++ to time just the optimize call in the command.cpp, which are the results shown above.
The I/O for cmdstan is actually 5.5 to 6 minutes, which seems excessive for reading a 2GB .json file.
I was not able to stan_rdump the data to test the .R format as I ran out of RAM generating it.
For Pystan I used the tf generated values as input (the same way as the blog post) so I didnt time those.
I am going to replace the GPU with an NVIDIA one to also test that with TF and cmdstan. Not sure if I will manage today since replacing AMD drivers to NVIDIA on a Ubuntu machine is a bit stressful.
EDIT: the “not” in pystan should not be that much slower compared to rstan.
EDIT2: updated pystan timings, added GLMs