Any way to make Stan competitive with Tensorflow for maximum likelihood?

I have been playing with some toy examples in stan and tensorflow and was wondering if there was a way I could speed up the stan version I have posted here:

It is a simple logistic regression with 1 million observations and 250 predictors which on my laptop tensorflow runs in about 7 seconds and stan about 100.

Any comments would be really appreciated!

What is the difference if you dont include the compilation of the Stan model?

Hi @rok_cesnovar. The timing already does not include the compilation of the stan model, the compilation is done in:

stan_model = pystan.StanModel(stan_file, extra_compile_args=["-O3", "-march=native"])

I just double checked by running htop and looking for the cc1plus process.

I am only timing the optimizing part:

start = tm.time()
stan_mle = stan_model.optimizing(stan_data, init="0")
end = tm.time()

I found the stan bfgs optimizer good in certain situations, but running large models with many parameters recently I changed to ‘ucminf’ (r optimizer) and cut the convergence time to 1/10th or so. Using a hacked up not yet robust stochastic gradient descent approach has reduced it still further.

1 Like

Can you check how many function evaluations each optimization makes?

Since ucminf also uses BFGS (CRAN - Package ucminf), do you know where the difference comes from? Different line search or just different stopping rule? Stan BFGS is using not that good defaults for different tolerances, so you may get faster results by changing those.

I have experimented with control parameters somewhat, though hardly a robust comparison. It’s not just a tolerance thing though. Ucminf combines bfgs and trust region approaches, it’s not a ‘regular’ bfgs - it was much better (on the limited number of higher dim problems i tried) than other r bfgs optimizers.

I’m now picky about the wording: BFGS refers to the update rule of the inverse Hessian and thus it seems it’s regular BFGS with with a trust region type monitoring of the line search. It helps if we can recognize where the performance differences come from: is it the update rule (there are others, but I’m still assuming ucminf is using regular BFGS) or is it the line search (ucminf has a better line search and trust region approach is part of that, but there can be other modifications, too). My guess is that the implementation of ucminf works better especially in the initial part when the quadratic approximation is bad and ucminf uses less function evaluations in the line search at that time. In my experience Stan bfgs is using most of the function evaluations for the two first line searches. It would be interesting to see a plot of function values vs. iterations comparing these optimization algorithms. Stan is also using limited memory version, which may affect convergence speed for high-dimensional problems, and you can try also increasing the memory limit, but I guess that would not bee enough to explain 10 fold difference in the number of function evaluations.

Fair point re terminology, I’m a bit sloppy there :) and yeah I already had the stan bfgs memory limit out to 50 or 100, you’re right that it helped things. I was going to show the comparison you mentioned, but it seems that save_iterations=TRUE does not do anything, in Rstan at least.

stano <<- optimizing(sm, data=standata, init=0, algorithm = 'LBFGS', verbose=TRUE, save_iterations=TRUE, tol_rel_grad=0, tol_grad=0,tol_param=0, history_size=50,tol_rel_grad=0,sample_file='stano.txt')

Here is some information about the number of iterations and function evaluations:

>>> stan_mle = stan_model.optimizing(stan_data, init="0", verbose=True, refresh=1)
Initial log joint probability = -693147
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
       1       -153067       398.553       8135.91       0.001       0.001        2   
       2       -134589       2.75453       5320.96      0.3396      0.3396        3   
       3       -120848       5.50538       1094.32           1           1        4   
       4       -120278      0.922224       350.409      0.9502      0.9502        5   
       5       -120154      0.445196       310.868           1           1        6   
       6       -119800       1.65806       553.019           1           1        7   
       7       -118957       4.55923       1122.38           1           1        8   
       8       -116593       13.8279       2205.73           1           1        9   
       9       -110364       38.0724       4064.61           1           1       10   
      10      -96659.6       95.6122       8271.67           1           1       11   
      11      -72972.9       125.326       14710.6           1           1       12   
      12      -48860.9       78.3897       20054.9      0.3134           1       15   
      13      -34399.5       12.6584       8131.33      0.1609      0.2987       17   
      14      -29751.8       10.3422       4594.59      0.1606      0.1606       18   
      15      -29040.7         1.035       1773.42      0.3283           1       20   
      16      -28746.1       1.40649       1296.17      0.3311      0.3311       21   
      17      -28163.3       5.90758       472.261           1           1       22   
      18      -28136.4      0.614142       286.503      0.2512           1       24   
      19      -28125.8      0.373414       224.702      0.2534      0.2534       25   
      20      -28107.3       1.53199       129.288           1           1       26   
      21      -28106.4     0.0614243       11.4379      0.1723           1       28   
      22      -28106.4   0.000643128       9.62061       0.174       0.174       29   
      23      -28106.4    0.00281231       0.10144           1           1       30   
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance

When I was doing this I noticed that the first iteration takes a lot longer, could this be due to copying all the data? If you discount that, stan would be a lot faster than 100 seconds, however I’m not sure how to avoid it.

With the tensorflow implementation:

>>> mle.num_objective_evaluations
<tf.Tensor: id=25332, shape=(), dtype=int32, numpy=106>
>>> mle.num_iterations
<tf.Tensor: id=25331, shape=(), dtype=int32, numpy=27>

I can see also that (as you said) the arguments/defaults are not quite the same in the two implementations, you can see stan and tensorflow.

Is there a way to expose the stan objective function and autodiff gradients to R or python? That way I could trial other optimizers like ucminf very easily. A quick google search didn’t reveal anything, but I think it would be a really useful thing to have.

Thanks. It seems that tensorflow is computing 3 times more evaluations, but much faster which is not surprise as it’s optimized for this kind of tasks. The difference could be just due to multithreading. There are pull requests for some speedups for Stan, too (in addition of GPU and multithreading).

But not more than 50s?

Yes. You can find an example for R at log_prob and grad_log_prob functions — log_prob-methods • rstan and for Python at https://github.com/MichaelRiis/Python-ADVI/blob/master/Eightschool.ipynb

@avehtari thanks for the links, I didn’t know about those rstan functions. I’ll try using those with ucminf when I have time and report back.

These original numbers used tensorflow’s gpu based lbfgs method, correct?

When I run this tensorflow model on my laptop, with no gpu, it fits in about 19 seconds (23 iterations).

It’s my understanding that since the data were generated in tensorflow, then there’s no extra copying of data. Stan on the other hand is copying data around for this exercise.

In an attempt to better measure the time it took Stan to fit the model, without copying data around, I used CmdStan and inserted checkpoints like this one

std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();

I also matched tensorflow’s objective function tolerance settings, though not all of them because the two implementations don’t have all the same convergence criteria. I also matched history_size, but this had less of an effect.

Under this setting Stan measures pretty consistently about 40 seconds (23 iterations).

Stan uses doubles and tensorflow uses floats. Could this make up a 20 second difference?

2 Likes

Hi @roualdes.

Yes I am using tensorflow with a gpu (nvidia 1060, not particularly fast).

I did generate the data in tensorflow on the gpu, but it gets copied to the cpu as a numpy array when I make the dictionary stan_data and I am not timing this part:

stan_data = {"N": N, "P": P, "x": x.numpy(), "y": y[:, 0].numpy()}

I suspect that stan is doing a further copy of this data at the start of the optimizing call, or at least it is doing something which takes a long time before the iterations start moving smoothly (based on watching the output with verbose=True and refresh=1). I’m not sure if that is avoidable in pystan (or rstan).

For such big arrays I would expect float could be a lot faster (twice?) than double (can fit twice as many floats in a simd register and the cpu cache) but I don’t think it is possible to do that with stan right now. Not sure if there are any plans to allow different floating point arithmetic in stan but in some cases it could be quite useful.

You can choose various different floating point types in tensorflow, float (tf.float32) is by far the fastest on my gpu and is the default in tensorflow, though. If I have time I’ll try with double and run on my CPU later.

Maybe in PyStan. Data transfer can be expensive compared to L-BFGS in problems this size if done poorly. We found that in just reading data into RStan when evaluating optimization problems at this scale. The I/O was dwarfing the time to fit in L-BFGS.

I followed up on the Jeff Pollock’s original blog post. Running in RStan on my 2012 Macbook Pro, it takes about 60s. (I turned off the Hessian and sampling.)

I didn’t try running in CmdStan as that will almost certainly be I/O dominated just getting the data in (250 predictors for 1M items is roughly 1GB with double-precision arithmetic). At least that’s what I found last time I tried CmdStan.

Hi @Bob_Carpenter. Thanks for all the comments - really helpful.

I’ve updated the post and ran the example using rstan - it runs in about 30 seconds on my laptop!

Looking forward to running this again with bernoulli_logit_glm, map_rect, and the gpu stuff once it is available!

3 Likes

We talked about this in stan but it would be kind of a tedious rewrite

You can try out the GPU-optimized bernoulli_logit_glm_lpdf if you don’t mind using cmdstan with an experimental Stan Math:

Logistic regression is actually one of the toy examples in the paper. I’m very interested in how this compares to TensorFlow on a single GPU. Experiments show that Stan GLMs on the GPU are about 50x faster than not using the GLM primitives and running on the CPU.

@rok_cesnovar, @stevebronder, what’s the ETA on all this being released in Stan Math? The GPU stuff will at some point in the near future be available through rstan?

I have been historically bad at predicting the ETA for all the GPU features so I will instead list what is left to be merged:

  • add a double template to matrix_cl (see https://github.com/stan-dev/math/pull/1281, done, reviewed & approved, waiting due to Jenkins Windows issues)
  • add the var template to matrix_cl (I think Steve is close to finishing that branch, then I hope a week of iterating with reviews if we will be efficient)
  • add caching to matrix_cl with the double template type (the above two lines should make this step easier, the last caching branch was too confusing and difficult to review so Steve split this in three PRs)
  • add GPU GLMs one by one (these are ready and waiting, but need caching merged)

2.21 should definitely have GPU GLMs, estimating when they get merged to develop is a bit harder to do. So 2.21 should have all the feature we promised at the last Stancon + GLMs and cov_exp_quad.

Rstan 2.19 already has GPU support for cholesky_decompose (that was added for Stan Math 2.19).

1 Like

This is pretty much done besides some of the copy functions

1 Like