Potential slowness in operands_and_partials

Do you guys want to post the current status on investigating the GLM slowness with operands_and_partials?

I have some related stuff - I have an old-ish branch where I started to refactor operands_and_partials to inherit from vari to avoid copies on build() (which seems to be called a lot). I also created an autodiff-arena based std::allocator for std::vectors etc I am currently blanking on where that was going to be useful (wanted to write it out anyway in case it spurs someone else’s memory). Here’s the half-finished branch: https://github.com/stan-dev/math/compare/feature/operands_partials_less_copies#diff-0d3582fe44ec7614fb0e6930ee26cad5R67

cc @Matthijs @rok_cesnovar

In linear glm (normal_id_glm_lpdf) around half of the running time is spent in assigning derivatives (of type Eigen::Matrix) to operands_and_partials.edgeX_.partials_. It is not the build() methods that is slow, but the assignments.

Actually it is spent in constructing operands_and_partials struct.


This is a great insight! Sounds like it might be time to start using std::move (if only there were some way to do this automatically).

Can you write a little about how you figured this out? We could use a little more profiling knowledge. Thanks!

Most of my profiling knowledge consist of using VS profiler, which does not work here, because VS compiler is not supported by stan::math.

So I just added a few timing outputs to code to see which part takes too much time.

Also I have reimplemented normal_id_glm_lpdf using precomputed_gradients() instead of operands_and_partials to return results. This resulted in 2x faster code. However this way code became quite more complex as I templated input arguments. Of course this change gets irelevant if operands_and_partials is optimized.

In resulting code check whether x if finite takes around half of running time. So I replaced it with check on one of intermediate variables that depend on x. Since I left checks on other variables that should be equivalent to check on x. It results in another x2 speedup.

If you are interested in my code it is here: https://github.com/bstatcomp/math/blob/gpu_glm/stan/math/prim/mat/prob/normal_id_glm_lpdf.hpp

How did you measure time during assignment to partials_? The way Eigen works, many of these expressions (e.g. value_of(x).transpose() * mu_derivative; result in an expression template that does no work until it is assigned to a matrix. So if you did something like

auto result = value_of(x).transpose() * mu_derivative;
auto start = time();
ops_partials.edge4_.partials_ = result;
print(time() - start);

Then that would delay all of the execution of the computation of result until assignment and it would seem like assigment was slow.

Also did anyone try using std::move?

No, I did calculations and assignments separately - as if I replaced auto result with Matrix<...> result your code. However I did not try std::move. I will try it now.

First I just figured out I made a mistake in determining bottleneck earlier. Actual bottleneck is constructing the operands_and_partials struct. Assigments are fine.

I also tried std::move on arguments to constructor, but it does not help.

I think the only heavy part of construction is allocating and zeroing the storage for the partials… Not sure there’s a way around that, though I can see how hand-allocating partials would be faster (especially if there are going to be straight up assignments and not just partials_ +=ing. Food for thought.

Does anyone have a model and data I can use for benchmarking operands and partials? Is the normal GLM the best one? I think we could easily switch to Eigen maps (though I think there are probably 2 or 3 other opportunities for optimization and modernization).

On the theory that operands and partials construction is the main bottleneck and that it is slow mostly because it allocates eigen vectors and matrices on the heap for partials, I switched them to use Eigen::Maps on pointers allocated from the autodiff memory and std::vectors with a new std::allocator based on the autodiff arena, all in this branch: https://github.com/stan-dev/math/compare/perf/operands_and_partials_deux

Then I ran the cmdstan performance tests on the branch here and got what looks like a half a percent speedup, max, though I saw it up to 1.5%. But this is nothing close to what seems like the speed gap, but I’m also not sure these models are representative of whatever was being measured above by @tadej and @Matthijs.

I’m just posting here so we can keep collective track of the issue as new evidence comes up and new designs and hypotheses are tried to fix the issue :)

Isn’t the main difference between ops and partials vs precomputed_gradients that the ops and partials approaches forces the program to duplicate in memory the partials? With ops and partials we fill in the partials and then call at the end build. The build method can (from a design perspective) be called many times on the same ops and partials - we just never do that. Thus, the design enforces that the partials have to be copied, since the instance of ops and partials remains valid. In contrast, the precomputed gradient approach does directly put things on the AD stack.

Maybe we should change the ops and partials build method to a finalize method which moves the partials to the AD stack and at the same time makes the ops and partials instance unusable for further use? The trick will probably be to make it clear to the compiler that after calling finalize, the ops and partials is invalidated… maybe a move operation for ops and partials will do that (however that is done).

Not sure if this post is useful, I hope it moves us further along. A 2x speedup in ops and partials would have a massive impact on Stan’s overall performance.

@seantalts: Did you manage to see the 2x speedup as well which was claimed earlier?

So the last optimization I tried was just changing how the internal Eigen and vector types were allocating their heap memory, from the standard allocator to autodiff memory arena memory (e.g. here). I think another option is to do what you’re suggesting and change it to a finalize method that returns a var based on its own internal memory (and tries to avoid a copy). There is a little bit of code headed in that direction in the first post but I’m not sure it’s using the right way forward, actually (so someone else should, as always, feel free to take a shot).

I’m not sure what specific code they were testing for the 2x speedup, so I haven’t reproduced it locally yet (though I’d like to).

If you want to reproduce the speedup you can checkout my branch and run the benchmark I have written https://github.com/bstatcomp/math/blob/gpu_glm/clGlm/glm_minitest.cpp
It runs multiple of my implementations. You are interested in on one that is labeled “library implementation”. Compare it between stan::math from my branch and develop.

1 Like

Thanks! I will take a look at this and see if my branch changes things at all. I’d also be curious about benchmarks across a variety of data sizes - these 4x3 matrices are probably not indicative of the types of data where we have performance issues [edit: oops, I see there are some with 20000 x 1000 as well]

I profiled with Instruments and saw that most of the stuff showing up in the inverse call tree is related to isFinite():

So I specialized our implementation for Eigen types and got a 38% speedup on the benchmark (though of course it’d be even better to avoid doing some of that work if possible as @tadej suggested).

Made a separate PR just for this - looks like it might improve performance a lot in places where we can’t eliminate these checks.

Next up in my profiling was indeed the operands and partials constructor! Looks like it’s copying a matrix from another matrix: image Maybe that’s the zeroing that’s happening…


Good catch & fix.

I think these GLMs are a place where we can eliminate em’. But this’ll be good elsewhere.

I can’t for the life of me figure out where Eigen::Matrix<double, -1, -1, 0, -1, -1>::Matrix(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) is getting called during operands_and_partials construction. Here is (I believe) the specialization in question:

template <int R, int C>
class ops_partials_edge<double, Eigen::Matrix<var, R, C> > {
  typedef Eigen::Matrix<var, R, C> Op;
  typedef Eigen::Matrix<double, R, C> partials_t;
  partials_t partials_;                       // For univariate use-cases
  broadcast_array<partials_t> partials_vec_;  // For multivariate
  explicit ops_partials_edge(const Op& ops)
      : partials_(partials_t::Zero(ops.rows(), ops.cols())),
        operands_(ops) {}

I even replaced that partials_t with an Eigen::Map pointing to autodiff memory and I still get the above call. Anyone have ideas?

At this point it might be worth noting that this part of the call is taking a total of 6% of the total time for this benchmark (though 60% is going to check_finite, 10% is going to an eigen matrix operation, and 11% is going to allocating random matrices in the first place). So 6% is pretty major in that context I guess.

@tadej Are you planning to put some of your stuff into a PR? Especially interested in sagacious removal of unnecessary checks.

I think Zero() creates new matrix which is than used to construct partials_. Thr call you are looking at is the construction of partials_.

@seantalts By the way which optimization level did you use for profiling? I used -O3.

I can make a PR. What should I put in it? Should I remove just the most performance intensive checks? Or just all of them?

1 Like

OMG I found it. So embarrassing. This must be leftover from a previous design that never made it in:

--- a/stan/math/prim/mat/meta/operands_and_partials.hpp
+++ b/stan/math/prim/mat/meta/operands_and_partials.hpp
@@ -22,7 +22,7 @@ class ops_partials_edge<ViewElt, Eigen::Matrix<Op, R, C>> {
   partials_t partials_;
   empty_broadcast_array<partials_t, Eigen::Matrix<Op, R, C>> partials_vec_;
   ops_partials_edge() {}
-  explicit ops_partials_edge(const Eigen::Matrix<Op, R, C> ops) {}
+  explicit ops_partials_edge(const Eigen::Matrix<Op, R, C>& ops) {}
   template <typename, typename, typename, typename, typename, typename>
@@ -40,7 +40,7 @@ class ops_partials_edge<ViewElt, std::vector<Eigen::Matrix<Op, R, C>>> {
   typedef empty_broadcast_array<ViewElt, Eigen::Matrix<Op, R, C>> partials_t;
   empty_broadcast_array<partials_t, Eigen::Matrix<Op, R, C>> partials_vec_;
   ops_partials_edge() {}
-  explicit ops_partials_edge(const std::vector<Eigen::Matrix<Op, R, C>> ops) {}
+  explicit ops_partials_edge(const std::vector<Eigen::Matrix<Op, R, C>>& ops) {}
   template <typename, typename, typename, typename, typename, typename>
@@ -60,7 +60,7 @@ class ops_partials_edge<ViewElt, std::vector<std::vector<Op>>> {
   partials_t partials_;
   empty_broadcast_array<partials_t, std::vector<std::vector<Op>>> partials_vec_;
   ops_partials_edge() {}
-  explicit ops_partials_edge(const std::vector<std::vector<Op>> ops) {}
+  explicit ops_partials_edge(const std::vector<std::vector<Op>>& ops) {}

This change removes operands_and_partials constructor from my list, at least. I’m also using -O3, though I was also experimenting with -g -fomit-frame-pointer at one point.

@tadej It does create a new matrix, but the compiler essentially creates that in partials_, see https://isocpp.org/wiki/faq/ctors#init-lists

@bbbales2 Which checks can we remove from the GLMs? I wasn’t totally following.

[edit - with this change I now see @tadej’s normal GLM benchmark running in about 56% of the time it takes on `develop - close to recovering that 2x, though it seems like maybe there is another 2x in there?]