Dot_product vs vectorization

Hi all,

It’s my first post here, but I hope it you will find it somewhat valuable. I’ve recently been trying to optimize a model that performs a lot of dot_product operations:

transformed parameters {
  ...
  for(n in 1:N)
    beta_x[n] = dot_product(beta[conversion_type[n], vertical[n]], x[n]);
}           
            
model {     
    y ~ student_t(nu, beta_x, sigma);
}   

and noticed how expensive dot_product operation is in stan. Profiling it with perf and cmdstan I got the following results:

+   22.33%  R        file26bf7e2571d5fa.so  [.] stan::math::dot_product<stan::math::var, -1, 1, double, -1, 1> 
+   11.76%  R        libjemalloc.so.2       [.] malloc 
+   11.43%  R        file26bf7e2571d5fa.so  [.] stan::math::check_range 

Being quite surprised that dot_product runs malloc, I dug into the code and found that what’s really multiplied is an Eigen::VectorXd<stan::math::var> times Eigen::VectorXd<double> and that the former one is being rewritten into the latter type for every dot_product operation.

It seems that a huge performance improvement (especially given that tls access is slower for dynamically linked binaries and hence rewritting vectors in rstan is even slower than in cmdstan) can be achieved by storing vectors of stan::math::vars as Eigen::VectorXd<double>s of values and derivatives.

Has anyone ever looked into it? I’ll be happy to look into it more deeply, but if you already know that it’s impossible to implement, I won’t need to waste my time trying.

2 Likes

Welcome to the Stan forums!

Rule one is to post enough of your model so we don’t have to guess what the types of things are. Presumably beta is something like a two-dimensional array of vectors and x is either a matrix or something else?

There’s a chapter in the manual on representations to reduce copying. Making x an array of vectors, for example, is much more efficient than making it a matrix.

Alternatively, rather than doing a sequence of N dot products, you can combine it all into a single 'rows_dot_product` operation by constructing the appropriate matrices first.

We know it’s very challenging. As far as I know, nobody’s figured this out for any autodiff system. The problem is how to also implement the indexing operation. It’s fine if you only ever access vectors or matrices holistically.

This is very misleading. Those range checks won’t nearly consume the amount of time it looks like they do if you instrument directly. Profiling has the problem of interrupting highly optimized operations and adding timing artifacts.

Rule one is to post enough of your model so we don’t have to guess what the types of things are.

Sorry, my bad, I think this would be the minimal full stan code that illustrates the problem:

data {
      int<lower=0> N;
      int<lower=0> T;
      int<lower=0> V; 
      int<lower=0> C; 
      vector[N] y; 
      simplex[C] x[N];
      int<lower=0> c[N];
      int<lower=0> v[N]; 
}

parameters {
  vector<lower=0, upper=1>[C] beta_t[T];
  vector<lower=0, upper=1>[C] beta_v[V];
  real<lower=0> nu;
  real<lower=0.01> sigma; 
}
transformed parameters {
  vector[C] beta[T, V];
  vector[N] beta_x;

  for(t in 1:T)
    for(v in 1:V)
      beta[t, v] =  (beta_t[t] + beta_v[v]) ;

  for(n in 1:N)
    beta_x[n] = dot_product(beta[c[n], v[n]], x[n]);
}

model {
  y ~ student_t(nu, beta_x, sigma);
}

For the context C is much smaller than N, which also plays a role here.

There’s a chapter in the manual on representations to reduce copying. Making x an array of vectors, for example, is much more efficient than making it a matrix.

The reason why copying is happening is because stan vectors are represented as Eigen::VectorXd<stan::math::var> in C++ and in order to perform dot_product stan math library rewrites Eigen::VectorXd<stan::math::var> to Eigen::VectorXd<double>. and then calls Eigen .dot() function.

Alternatively, rather than doing a sequence of N dot products, you can combine it all into a single 'rows_dot_product` operation by constructing the appropriate matrices first.

I tried this approach too, but stan’s rows_dot_product compiles to:

  for (size_type j = 0; j < v1.rows(); ++j) {
    ret(j) = var(new dot_product_vari<T1, T2>(v1.row(j), v2.row(j)));
  }

so there can be no performance gain from using it.

We know it’s very challenging. As far as I know, nobody’s figured this out for any autodiff system

This was really the core of my question. If storing vector values as vectors of doubles can’t be done successfully, it might still be possible to speed up dot_product quite significantly by pre-allocating a buffer for those rewrites. Currently dot_product spends most of the time waiting for malloc.

This is very misleading. Those range checks won’t nearly consume the amount of time it looks like they do if you instrument directly. Profiling has the problem of interrupting highly optimized operations and adding timing artifacts.

No, I don’t think that’s how perf works and these measurements are very accurate. The range checks do indeed consume a lot of time (removing them manually from the c++ code gives the expected 1.1x speedup) for the reason I already mentioned - there’s a lot of random memory access happening there.

Yup. We have to do that to pack and unpack autodiff variables into Eigen.

One thing we could do is use our own memory pool rather than calling out to malloc. We’re gradually rewriting some of these algorithms to be better behaved in that way. But it’s slow going given other priorities.

Thanks for clarifying—I didn’t realize it was lots of random memory access. I would’ve expected bounds checks to be memory local, at least with matrix or vector types.

We’ve found a lot of hard-to-identify artifacts in performace profiling Stan code, especially arounds bounds checking.

Is there anything I could do to help get it done sooner? Where can I find an example of an algorithm using that pool?

Minimum working examples help out a lot! If you can post data along with this, it would help someone just validate things like the range checks consuming a lot of time and we can do something about it. That’s straightforward enough for someone to tackle a bit sooner.

There’s the description in the arXiv paper on Stan’s autodiff. That’s the best thing to read about how things are done now.

The pool is in the memory directory and the simplest use is in stan/math/rev/core/precomputed_gradients.hpp. It gets collected all at once when the reverse pass is done to compute gradients (basically, that’s just a constant pointer reset—it’s done this way to avoid having to call any destructors—so don’t put anything using malloc into a vari implementation).

You can then use Eigen::Map on top of raw data, which is pretty efficient. Or you can just access the memory directly if you don’t need matrix ops.