Dot_product vs vectorization

olorin · April 13, 2018, 12:06am

Hi all,

It’s my first post here, but I hope it you will find it somewhat valuable. I’ve recently been trying to optimize a model that performs a lot of dot_product operations:

transformed parameters {
  ...
  for(n in 1:N)
    beta_x[n] = dot_product(beta[conversion_type[n], vertical[n]], x[n]);
}           
            
model {     
    y ~ student_t(nu, beta_x, sigma);
}

and noticed how expensive dot_product operation is in stan. Profiling it with perf and cmdstan I got the following results:

+   22.33%  R        file26bf7e2571d5fa.so  [.] stan::math::dot_product<stan::math::var, -1, 1, double, -1, 1> 
+   11.76%  R        libjemalloc.so.2       [.] malloc 
+   11.43%  R        file26bf7e2571d5fa.so  [.] stan::math::check_range

Being quite surprised that dot_product runs malloc, I dug into the code and found that what’s really multiplied is an Eigen::VectorXd<stan::math::var> times Eigen::VectorXd<double> and that the former one is being rewritten into the latter type for every dot_product operation.

It seems that a huge performance improvement (especially given that tls access is slower for dynamically linked binaries and hence rewritting vectors in rstan is even slower than in cmdstan) can be achieved by storing vectors of stan::math::vars as Eigen::VectorXd<double>s of values and derivatives.

Has anyone ever looked into it? I’ll be happy to look into it more deeply, but if you already know that it’s impossible to implement, I won’t need to waste my time trying.

Bob_Carpenter · April 13, 2018, 9:20pm

Welcome to the Stan forums!

Rule one is to post enough of your model so we don’t have to guess what the types of things are. Presumably beta is something like a two-dimensional array of vectors and x is either a matrix or something else?

There’s a chapter in the manual on representations to reduce copying. Making x an array of vectors, for example, is much more efficient than making it a matrix.

Alternatively, rather than doing a sequence of N dot products, you can combine it all into a single 'rows_dot_product` operation by constructing the appropriate matrices first.

We know it’s very challenging. As far as I know, nobody’s figured this out for any autodiff system. The problem is how to also implement the indexing operation. It’s fine if you only ever access vectors or matrices holistically.

This is very misleading. Those range checks won’t nearly consume the amount of time it looks like they do if you instrument directly. Profiling has the problem of interrupting highly optimized operations and adding timing artifacts.

olorin · April 17, 2018, 1:24am

Rule one is to post enough of your model so we don’t have to guess what the types of things are.

Sorry, my bad, I think this would be the minimal full stan code that illustrates the problem:

data {
      int<lower=0> N;
      int<lower=0> T;
      int<lower=0> V; 
      int<lower=0> C; 
      vector[N] y; 
      simplex[C] x[N];
      int<lower=0> c[N];
      int<lower=0> v[N]; 
}

parameters {
  vector<lower=0, upper=1>[C] beta_t[T];
  vector<lower=0, upper=1>[C] beta_v[V];
  real<lower=0> nu;
  real<lower=0.01> sigma; 
}
transformed parameters {
  vector[C] beta[T, V];
  vector[N] beta_x;

  for(t in 1:T)
    for(v in 1:V)
      beta[t, v] =  (beta_t[t] + beta_v[v]) ;

  for(n in 1:N)
    beta_x[n] = dot_product(beta[c[n], v[n]], x[n]);
}

model {
  y ~ student_t(nu, beta_x, sigma);
}

For the context C is much smaller than N, which also plays a role here.

There’s a chapter in the manual on representations to reduce copying. Making x an array of vectors, for example, is much more efficient than making it a matrix.

The reason why copying is happening is because stan vectors are represented as Eigen::VectorXd<stan::math::var> in C++ and in order to perform dot_product stan math library rewrites Eigen::VectorXd<stan::math::var> to Eigen::VectorXd<double>. and then calls Eigen .dot() function.

Alternatively, rather than doing a sequence of N dot products, you can combine it all into a single 'rows_dot_product` operation by constructing the appropriate matrices first.

I tried this approach too, but stan’s rows_dot_product compiles to:

  for (size_type j = 0; j < v1.rows(); ++j) {
    ret(j) = var(new dot_product_vari<T1, T2>(v1.row(j), v2.row(j)));
  }

so there can be no performance gain from using it.

We know it’s very challenging. As far as I know, nobody’s figured this out for any autodiff system

This was really the core of my question. If storing vector values as vectors of doubles can’t be done successfully, it might still be possible to speed up dot_product quite significantly by pre-allocating a buffer for those rewrites. Currently dot_product spends most of the time waiting for malloc.

This is very misleading. Those range checks won’t nearly consume the amount of time it looks like they do if you instrument directly. Profiling has the problem of interrupting highly optimized operations and adding timing artifacts.

No, I don’t think that’s how perf works and these measurements are very accurate. The range checks do indeed consume a lot of time (removing them manually from the c++ code gives the expected 1.1x speedup) for the reason I already mentioned - there’s a lot of random memory access happening there.

Bob_Carpenter · April 17, 2018, 1:44am

Yup. We have to do that to pack and unpack autodiff variables into Eigen.

One thing we could do is use our own memory pool rather than calling out to malloc. We’re gradually rewriting some of these algorithms to be better behaved in that way. But it’s slow going given other priorities.

Thanks for clarifying—I didn’t realize it was lots of random memory access. I would’ve expected bounds checks to be memory local, at least with matrix or vector types.

We’ve found a lot of hard-to-identify artifacts in performace profiling Stan code, especially arounds bounds checking.

olorin · April 17, 2018, 2:34am

Is there anything I could do to help get it done sooner? Where can I find an example of an algorithm using that pool?

syclik · April 17, 2018, 3:02am

Minimum working examples help out a lot! If you can post data along with this, it would help someone just validate things like the range checks consuming a lot of time and we can do something about it. That’s straightforward enough for someone to tackle a bit sooner.

Bob_Carpenter · April 17, 2018, 3:05am

There’s the description in the arXiv paper on Stan’s autodiff. That’s the best thing to read about how things are done now.

The pool is in the memory directory and the simplest use is in stan/math/rev/core/precomputed_gradients.hpp. It gets collected all at once when the reverse pass is done to compute gradients (basically, that’s just a constant pointer reset—it’s done this way to avoid having to call any destructors—so don’t put anything using malloc into a vari implementation).

You can then use Eigen::Map on top of raw data, which is pretty efficient. Or you can just access the memory directly if you don’t need matrix ops.

Topic		Replies	Views
Improving the efficiency of *_dot_product for trinary-and-redundant matrices Modeling	38	1170	May 3, 2023
Sparse matrices in Stan Developers	12	2549	February 8, 2018
Stan SIMD & Performance Algorithms	23	4484	January 23, 2020
PSA: where possible, use columns_dot_product rather than rows_dot_product Modeling techniques , specification	1	774	October 28, 2021
Problem computing the Hessian Modeling	1	560	May 13, 2018

Dot_product vs vectorization

Related topics