I have been playing with some toy examples in stan and tensorflow and was wondering if there was a way I could speed up the stan version I have posted here:
It is a simple logistic regression with 1 million observations and 250 predictors which on my laptop tensorflow runs in about 7 seconds and stan about 100.
I found the stan bfgs optimizer good in certain situations, but running large models with many parameters recently I changed to ‘ucminf’ (r optimizer) and cut the convergence time to 1/10th or so. Using a hacked up not yet robust stochastic gradient descent approach has reduced it still further.
Can you check how many function evaluations each optimization makes?
Since ucminf also uses BFGS (CRAN - Package ucminf), do you know where the difference comes from? Different line search or just different stopping rule? Stan BFGS is using not that good defaults for different tolerances, so you may get faster results by changing those.
I have experimented with control parameters somewhat, though hardly a robust comparison. It’s not just a tolerance thing though. Ucminf combines bfgs and trust region approaches, it’s not a ‘regular’ bfgs - it was much better (on the limited number of higher dim problems i tried) than other r bfgs optimizers.
I’m now picky about the wording: BFGS refers to the update rule of the inverse Hessian and thus it seems it’s regular BFGS with with a trust region type monitoring of the line search. It helps if we can recognize where the performance differences come from: is it the update rule (there are others, but I’m still assuming ucminf is using regular BFGS) or is it the line search (ucminf has a better line search and trust region approach is part of that, but there can be other modifications, too). My guess is that the implementation of ucminf works better especially in the initial part when the quadratic approximation is bad and ucminf uses less function evaluations in the line search at that time. In my experience Stan bfgs is using most of the function evaluations for the two first line searches. It would be interesting to see a plot of function values vs. iterations comparing these optimization algorithms. Stan is also using limited memory version, which may affect convergence speed for high-dimensional problems, and you can try also increasing the memory limit, but I guess that would not bee enough to explain 10 fold difference in the number of function evaluations.
Fair point re terminology, I’m a bit sloppy there :) and yeah I already had the stan bfgs memory limit out to 50 or 100, you’re right that it helped things. I was going to show the comparison you mentioned, but it seems that save_iterations=TRUE does not do anything, in Rstan at least.
When I was doing this I noticed that the first iteration takes a lot longer, could this be due to copying all the data? If you discount that, stan would be a lot faster than 100 seconds, however I’m not sure how to avoid it.
I can see also that (as you said) the arguments/defaults are not quite the same in the two implementations, you can see stan and tensorflow.
Is there a way to expose the stan objective function and autodiff gradients to R or python? That way I could trial other optimizers like ucminf very easily. A quick google search didn’t reveal anything, but I think it would be a really useful thing to have.
Thanks. It seems that tensorflow is computing 3 times more evaluations, but much faster which is not surprise as it’s optimized for this kind of tasks. The difference could be just due to multithreading. There are pull requests for some speedups for Stan, too (in addition of GPU and multithreading).
These original numbers used tensorflow’s gpu based lbfgs method, correct?
When I run this tensorflow model on my laptop, with no gpu, it fits in about 19 seconds (23 iterations).
It’s my understanding that since the data were generated in tensorflow, then there’s no extra copying of data. Stan on the other hand is copying data around for this exercise.
In an attempt to better measure the time it took Stan to fit the model, without copying data around, I used CmdStan and inserted checkpoints like this one
I also matched tensorflow’s objective function tolerance settings, though not all of them because the two implementations don’t have all the same convergence criteria. I also matched history_size, but this had less of an effect.
Under this setting Stan measures pretty consistently about 40 seconds (23 iterations).
Stan uses doubles and tensorflow uses floats. Could this make up a 20 second difference?
Yes I am using tensorflow with a gpu (nvidia 1060, not particularly fast).
I did generate the data in tensorflow on the gpu, but it gets copied to the cpu as a numpy array when I make the dictionary stan_data and I am not timing this part:
I suspect that stan is doing a further copy of this data at the start of the optimizing call, or at least it is doing something which takes a long time before the iterations start moving smoothly (based on watching the output with verbose=True and refresh=1). I’m not sure if that is avoidable in pystan (or rstan).
For such big arrays I would expect float could be a lot faster (twice?) than double (can fit twice as many floats in a simd register and the cpu cache) but I don’t think it is possible to do that with stan right now. Not sure if there are any plans to allow different floating point arithmetic in stan but in some cases it could be quite useful.
You can choose various different floating point types in tensorflow, float (tf.float32) is by far the fastest on my gpu and is the default in tensorflow, though. If I have time I’ll try with double and run on my CPU later.
Maybe in PyStan. Data transfer can be expensive compared to L-BFGS in problems this size if done poorly. We found that in just reading data into RStan when evaluating optimization problems at this scale. The I/O was dwarfing the time to fit in L-BFGS.
I followed up on the Jeff Pollock’s original blog post. Running in RStan on my 2012 Macbook Pro, it takes about 60s. (I turned off the Hessian and sampling.)
I didn’t try running in CmdStan as that will almost certainly be I/O dominated just getting the data in (250 predictors for 1M items is roughly 1GB with double-precision arithmetic). At least that’s what I found last time I tried CmdStan.
You can try out the GPU-optimized bernoulli_logit_glm_lpdf if you don’t mind using cmdstan with an experimental Stan Math:
Logistic regression is actually one of the toy examples in the paper. I’m very interested in how this compares to TensorFlow on a single GPU. Experiments show that Stan GLMs on the GPU are about 50x faster than not using the GLM primitives and running on the CPU.
@rok_cesnovar, @stevebronder, what’s the ETA on all this being released in Stan Math? The GPU stuff will at some point in the near future be available through rstan?
add the var template to matrix_cl (I think Steve is close to finishing that branch, then I hope a week of iterating with reviews if we will be efficient)
add caching to matrix_cl with the double template type (the above two lines should make this step easier, the last caching branch was too confusing and difficult to review so Steve split this in three PRs)
add GPU GLMs one by one (these are ready and waiting, but need caching merged)
2.21 should definitely have GPU GLMs, estimating when they get merged to develop is a bit harder to do. So 2.21 should have all the feature we promised at the last Stancon + GLMs and cov_exp_quad.
Rstan 2.19 already has GPU support for cholesky_decompose (that was added for Stan Math 2.19).