Performance Graphs for New GLM Primitives

Hi all,

The new GLM primitives have been merged into the math library. To give you an idea of their performance, I’ve done some testing on my laptop (4GB memory, 2 cores). These graphs are showing the performance increase on the mean time of 30 gradient computations (in C++). In all cases, the new GLM primitives are compared to the current fastest way of writing a GLM using existing primitives. All inputs are given as parameters, except for the matrix of covariates, which is specified as data. (Although the primitives allow this to be a parameter as well, in which case the performance increase should be much larger still.)


Observe that performance is not hugely dependent on the size of the data set, but increases pretty rapidly if we have models with more parameters. (At the high end of the parameter sizes I tested, my laptop started to run out of memory.)


It would be awesome to have this as a benchmark applied to each release…

Super exciting. Thanks for redoing as line graphs and publishing here.

What you’d really want to do if we were going to try to publish this is replicate a bunch of data sets and plot averages to try to smooth things out.

Do you need help getting these into the Stan language? It should just be one simple pull request to stan-dev/stan.

P.S. Base 3 is an unusual choice—I’d have gone with 4 or 10 just so I could understand the data sizes on the X axis (not that I’m asking you to re-do these—the plots are great for what we need, which is a rough idea of speedup).

Hi Bob! Glad you’re happy with them. I did the graphs very quickly, just to get a rough idea. For a case study, I’d do something a bit nicer. I’m sure I can get them into the Stan language this week. If I get completely confused, I’ll ask. :-)

For those who are interested, @tadej , @Erik_Strumbelj , @seantalts and I have been working a bit more on the GLMs and have performed some end-to-end benchmarks, comparing the new GLMs (named lr_glm in the graphs; 2 slightly different versions for linear regression) to the old fastest Stan implementation without the GLM primitives (named lr_basic in the graphs).
We have fit the regression coefficient and intercept and, in case of neg binomial and linear regression, the scale parameters on dataset that was randomly generated from a corresponding GLM with a fixed value for those 2-3 parameters, using NUTS with a fixed number of iterations.
The computation time is plotted for 10, 100 and 500 parameters for a range of the size n of the dataset.
(Apologies for the ridiculous regression lines fit through the points. I am new to ggplot.)

Basically, the lesson seems to be that the new GLMs become more worthwhile if one has a larger number of parameters.
We hope that this could be complementary to the data-parallelism that we could exploit using a future GPU implementation.

linear 10 - 100 - 500 param.pdf (28.0 KB)
logistic - Rplots - 10 - 100 - 500 param.pdf (18.3 KB)
negative binomial - Rplots - 10 - 100 - 500 param.pdf (18.3 KB)
poisson - Rplots - 10 - 100 - 500 param.pdf (23.7 KB)


Thanks @Matthijs!

If you have too few observations (n) per feature (k), inference fails and algorithms look very fast (that’s why the first couple of measurements are weird). I’d suggest starting with n = 500.

A couple of suggestions regarding the plots:

  • Plotting the k = 10, 100, and 100 side by side would make plots easier to read. Also, marking which plot is which k would help.
  • No point in fitting a lasso smooth if you have only a few points (linear would do, or even without).
  • Don’t sacrifice half the plot area for the legend (move it to the bottom).

Thanks, @Erik_Strumbelj! You’re right of course about the plots. Will do a better job when we are producing more official plots later.

I’ve made some plots for varying k (both on log scale and linear scale) for fixed n=500. It looks like the speed up only really kicks in at k=50 there.

linear - Rplots - 500data 2.pdf (9.3 KB)
linear - Rplots - 500data.pdf (9.3 KB)
logistic - Rplots - 500 data 2.pdf (8.1 KB)
logistic - Rplots - 500 data.pdf (8.0 KB)
negative binomial - Rplots - 500data 2.pdf (8.1 KB)
negative binomial - Rplots - 500data.pdf (8.1 KB)
poisson - Rplots - 500data 2.pdf (8.1 KB)
poisson - Rplots - 500data.pdf (8.2 KB)

It looks like the speedups can get huge though for large k.

could you run also with n=10K and n=100K?

OK, @avehtari ! And what value for k? (number of param)
Also, do you want the scale to be a parameter or data?

Any k is fine, but it’s just that n=2K is not that much for Stan, so I was assuming these speed-up should also show with a larger n. Good posterior inference is needed when k is also high relative to n, but I think it would be useful to test with something like k=10, 100, 500 in the beginning.

@avehtari, I ran some experiments for n=10K and k=200,400,600,800,1000 and the results were as follows, for (linear, logistic, negative binomial, poisson) regressions respectively.

I will try to run experiments for n=100K, but I’m doing them on my laptop, so this is very time intensive. @Erik_Strumbelj is getting a dedicated machine to do this sort of testing. Then, we’ll be able to make a more systematic assessment for a whole grid of n,k values. This will be especially interesting once there is a GPU implementation of the GLMs.

linear regression

logistic regression

negative binomial regression

poisson regression

Hmm. I am now realising that the figures above are underreporting the speedup as the sampling process wasn’t the only thing that was timed. We’ll have to wait for a better benchmark that will take a longer time to run. Overall though, the speedups are looking great. If anything, they are better with a lot of data.

EDIT: here’s a graph for the linear regression on n = 100 000, which should give you some idea. It took a long time to run. Hopefully, I’ll manage to produce one for the other GLMs as well at some point.


Looks super great! Negbin has strange bump up for both appraoches for k=800, and n=1e5 has strange bump at k=200, but otherwise clear results. Maybe these bumps are due to different number of log-density evaluations in sampling= It seems that basic approach has nonlinear growth, which I guess is due to increasing number of cache misses. Could you also measure the memory usage?

@avehtari Currently, the R scripts call cmdstan and I’m not aware of any reliable way of measuring memory usage in this case. If anyone has an idea, I’d be happy to implement it.

The unusual jumps in the graphs might just be due to dataset variability (my guess is due to the time it takes to find valid starting values for the parameters; it doesn’t help that the models have default improper priors :) ). There seems to be a lot of variability between datasets with same n and k and not at all a lot of variability re-running on the exact same dataset. Next version of the end-to-end tests scripts will capture that between-datasets variability (the current results just repeat on the same dataset for each n-k pair).

So what the unreliable ways you are aware? Just asking to not waste time suggesting any of those?
For example, is Valgrind one of those unreliable ones?

Sure, that’s what I meant by “due to different number of log-density evaluations in sampling” as different datasets may lead to different adaptation and smaller step size will lead to bigger trees with more log density evaluations.

Probably would be better to use sensible weakly informative priors

Unreliable was maybe the wrong word to use. I am not aware of any more elegant ways than using an external memory profiler and running it on the Stan model executable. So, for memory profiling, the user must have something like Valgrind. Also, the end-to-end test needs to be run twice (timing results will not be valid with the profiler on).

@avehtari, have you used Valgrind before? That is, would you recommend it?

Everything should be conditioned on the algorithms working.

Are you including proper priors? With priors, what I found is that the slow cases were ones where the number of data points is roughly equal to the number of predictors

If you save .png or .jpg, you can just paste them into the post.

Log scale on the y axis is really helpful because it’s easy to read off the speedups.

I think “unreliable” is OK—“wrong” would be better. For instance, you can’t just grab the system profiler on the Mac to get a tight idea of resources needed by a process.

The only memory worth considering is the autodiff memory and that can be measured internally in Stan if you’re in C++ using some functions exposed in stan/memory/stack_alloc.hpp — it measures the amount of memory we use for autodiff. We’ll actually allocate more than we use because we use an exponentially increasing block scheme. It would also be easy to add a new function to measure overall memory usage in the arena.

We ratchet up that memory and never free it until after sampling, so the actual memory pressure is from filling and clearing our arena as well as all the little objects like std::vector and Eigen::Matrix we use on the function call stack (because of the RAII pattern, there are internal mallocs even though the C++ client doesn’t see them).