Integrating GPU support

alex2awesome · October 3, 2016, 8:47pm

Hi all,

I am really curious to explore whether GPU integration might have an effect on the run-time of any of Stan’s algorithms.

Specifically, I was curious in exploring:

Which algorithm would be a good place to start?
Where are the major computational bottlenecks in this algorithm?
What parts of this algorithm might lend themselves most to parallelism?

I emailed Bob, Alp, and Dan about this question and here is a summary of their responses. This thread is an attempt to bring that conversation into the community forum.

Bob:

I think the easiest thing to do might be to work on the
matrix algebra—Eigen has some GPU support for some of its
expensive operations. I don’t know anything about GPUs!

First step would be to get a no-holds-barred prototype working
then think about integrating into Stan.

Dan:

If you need a particular example, write a logistic regression using the Stan language, use CmdStan to generate the C++, then hack the C++ to use GPUs. If you can get that going, we can see what we can do to get it integrated all the way through. (Was that specific enough? It’d be easy enough to provide a Stan program if it isn’t.)

We’d want GPU for computation of the log joint probability distribution that’s defined by the Stan program. You won’t touch any of the algorithms. We spend very little time in the algorithm code. We spend almost all of the time in the computation of gradients of the log joint probability distribution.

Alex

alex2awesome · October 3, 2016, 8:52pm

Hey Dan,

I generated .cpp code for Logistic regression (as outlined in the dose example on page 76 of Gelman’s BDA3) and for the 8 Schools example.

In logistic_model.cpp, I’m seeing two log_prob functions, one on line 171

                 std::ostream* pstream__ = 0) ```

and one on line 234: 

```T_ log_prob(Eigen::Matrix<T_,Eigen::Dynamic,1>& params_r,
               std::ostream* pstream = 0)```

The second is a wrapper for the first, no? And am I right in saying that most of the math for calculating the log_probability is handled by 

        stan::math::accumulator<T__> lp_accum__;

?
Thanks so much, 

Alex

syclik · October 4, 2016, 5:53pm

Hi Alex,

The second is a wrapper for the first.

Can you post the exact form of logistic regression (as a Stan program) that you’re using? Then we can talk through details. I don’t know too much about the details of GPU programming, so it’ll be me proposing things that I think could work and hopefully you can turn that into something that might be of use.

Daniel

alex2awesome · October 4, 2016, 6:36pm

Sure — it is based off the model proposed for measuring how dose effects (BDA3, page 76) and here’s the stan code I used to generate the .cpp for it:

data {
      int<lower=0> N;
      int n[N];
      vector[N] x;
      int y[N];
   }
parameters {
      real alpha;
      real beta;
   }
model {
      for (i in 1:N)
          y[i] ~ binomial_logit(n[i], alpha + beta * x[i]);
   }

GPU computation has grown more advanced — OpenCL has a variety of built-in mathematical functions like the sine function and even the Gamma function. (Full set of built-in functions listed on the second page of this reference card: https://www.khronos.org/files/opencl-1-2-quick-reference-card.pdf).

So, the kernel computations can get pretty extensive and intricate; thus, I’d say we should start by brainstorming any computational step that can be parallelized, not just limiting ourselves to thinking about simple, parallelizable computational steps. My instinct would be to start looking at the stan::math::accumulator class, but you probably have better instincts than me on this!

Bob_Carpenter · October 4, 2016, 7:29pm

The accumulator class can be asynchronous—it just builds up
a collection (potentially with repeated entries). The order doesn’t
matter, but add requests do need to be carried out exactly once.

But the accumulation is trivial compared to operations like matrix
multiply, so I’m not sure what you’re hoping to gain by this.

Bob

syclik · October 4, 2016, 9:19pm

Thanks. I’m glad I checked with you. I don’t think this would have been a good use-case for GPUs. (Maybe it is?)

Well… perhaps you can start with telling me what GPUs would be good for. Something like:

splitting out the for loop?
a really big matrix multiplication?
something else?

Perhaps I can tell you where most of the computation within a single Stan program run takes place. It is spent in automatic differentiation. See: [1509.07164] The Stan Math Library: Reverse-Mode Automatic Differentiation in C++

The cost to build up the expression graph isn’t as expensive as the backward pass to compute the gradients. But almost all our time is spent in computing gradients (and rightly so!). The natural place to improve speed is to compute the gradients faster. We can’t parallelize MCMC because it is a Markov chain and it depends on the previous state. We can’t parallelize within an iteration in HMC algorithms because each step depends on the previous. So we can try to parallelize the computation within the Stan program.

Perhaps try a large matrix multiplication instead of a for loop? That might be more natural to parallelize.

alex2awesome · October 4, 2016, 10:54pm

Hey Dan,

Ah I understand. You mean the for-loop around the binomial-logit? Gotcha, I wasn’t sure what the computation was like under the hood but it makes sense to focus on a Stanmodel without a for. Do you think the schools model on the website is a good candidate?

data {
    int<lower=0> J; // number of schools
    real y[J]; // estimated treatment effects
    real<lower=0> sigma[J]; // s.e. of effect estimates
}
parameters {
    real mu;
    real<lower=0> tau;
    real eta[J];
}
transformed parameters {
    real theta[J];
    for (j in 1:J)
        theta[j] <- mu + tau * eta[j];
}
model {
    eta ~ normal(0, 1);
    y ~ normal(theta, sigma);
}

Or do you think the for in the transformed parameters block is also a problem?

It’ll take me a bit to digest that paper and go through some of the code. Would you happen to have any time tomorrow by phone or Thursday morning to give me some pointers? (I can summarize what we talk about on this thread!) Or I can take a couple of days to go through this and check back in?

To answer your question, I’m going to hedge and say that the speed-up GPUs deliver varies case-by-case and it’s worthwhile trying to test any code-segment in CUDA on a GPU that you’d be able to parallelize by other means on a CPU. In addition to basic matrix computation, I’d be interested in maybe exploring whether there are bit-wise vector operations that could be looked at?

Alex

syclik · October 5, 2016, 3:02am

Let’s talk tomorrow afternoon. Google hangout? (What time zone are you in?)

avehtari · October 5, 2016, 8:02am

Hi, a few years ago I experimented with GPUs for GPstuff. I was able to easily get significant speed-ups by using GPUs for computing Choleskys and products for big matrices. I would guess this would be easy way to speed up GPs and big GLMs in Stan, too.

Aki

syclik · October 5, 2016, 11:38am

Hi Aki,

That’s good to know. That was my first thought too: large matrix operations.

Daniel

Bob_Carpenter · October 5, 2016, 5:55pm

You want to focus on GPUs for our matrix operations, which
are the major bottlenecks in bigger models and have a clear path
for GPU support through Eigen.

No point in parallelizing eight schools with GPUs—it runs in a fraction
of a second on an old notebook in one thread.

Bob

alex2awesome · October 6, 2016, 4:45am

Hey all,

Had a productive meeting with Bob and Dan today — thank you so much for taking the time to meet with me. We talked about several paths to explore:

Matrix multiplications: This seems like the most promising high-impact path to look at first, as basic matrix operations occur throughout the code. A slight speed up would thus have a large downstream effect.

1a. Looking solely at Eigen, can we start to profile the speedups they achieve on GPU vs. CPU. This would start by taking Eigen out of the context of Stan.

1b. We can also start to look at integrating Eigen 3.0 (the release of Eigen with GPU support) into Stan and seeing how this alone increases speed.

1c. Further exploration can be done with the people involved in the Eigen project to see if they need help implementing some matrix operations that will prove crucial to Stan but that they have not yet implemented.
ODEs: We looked at a paper that applied [GPU speedups to differential equations involved in chemical reactions], and talked about the how CVODES library used for DE calculations might be a candidate to start looking at. (http://www.hds.bme.hu/~fhegedus/2013%20Accelerating%20moderately%20stiff%20chemical%20kinetics%20in%20reactive-flow%20simulation%20using%20GPUs.pdf)

Anyway, I’ll start poking around at these this week/weekend. Probably will start with 1a and maybe the part of 1c that involves reaching out to Eigen.

Thanks again Dan and Bob for your time,

Alex

syclik · October 6, 2016, 5:21am

Thanks for coming up. Always happy to meet and discuss Stan.

A few things to note: matrix operations will have a wider impact than the ode solvers, but both are important. The Eigen version we are trying to target is 3.3 (not 3.0).

Do you still need a Stan program to test with? If so, look at the gaussian process examples in the example-models repo.

avehtari · October 6, 2016, 9:33am

If looking GP examples, then the latest updates by Rob for covariance function computation and Cholesky should be used.

syclik · October 6, 2016, 11:09am

+1.

Where are they?

seantalts · January 27, 2017, 10:34pm

Are these the GP models from Rob? https://github.com/stan-dev/stancon_talks/tree/master/2017/Contributed-Talks/08_trangucci

Also, have you guys seen this paper? thoughts? Fast Hamiltonian Monte Carlo Using GPU Computing

betanalpha · January 27, 2017, 10:52pm

This paper is a mess. They confound gradients with cost (with autodiff gradients cost the same order as the function evaluation itself – there is a higher cost per transition, but that’s because more evaluations are used for better overall performance) and use static HMC and terrible heuristics to try to get around some of the problems with it. But really they fit a single, simple model that reduces to matrix algebra amendable to GPUs. That’s it. Nothing special. Also they made some weird statement about Stan not being able to compute their model which is just completely wrong.

Bob_Carpenter · January 27, 2017, 11:34pm

What we very much would like to be able to do is
turn on GPU support for our matrix operations so that when
we have models with those matrix operations, they’ll go
faster. I believe that’s supported through Eigen as of
3.3, but I have no idea how to turn it on or use it.

Bob

seantalts · January 28, 2017, 2:20pm

A tricky thing here is that as far as I understand, there’s a fair bit of latency involved in copying data to the GPUs and back so we don’t necessarily want to ship all matrix operations over to GPUs (especially if there’s a lot of back and forth in a model). But we can do some empirical tests and see if we can figure out some criteria to use for deciding when to ship it over or if perhaps it ends up just being worth it all the time. I’d love to take a look at the Eigen configuration required for this at some point. I seem to recall a discussion about when to switch to Eigen 3.3 but can’t find it now - is that waiting for C++11 in April as well?

bgoodri · January 28, 2017, 3:59pm

No. Stan should work with Eigen 3.3.x now but Stan Math does not bundle 3.3.x yet because we are waiting for RcppEigen to move to 3.3.x

github.com/RcppCore/RcppEigen

Status of Eigen 3.3.2

opened 08:25PM - 26 Jan 17 UTC

closed 08:32PM - 01 May 17 UTC

yixuan

I just want to report the current status of updating `RcppEigen` to Eigen 3.3.2.… In short, there are a few incompatibilities between the dependent packages of `RcppEigen` and Eigen 3.3. Below is a list of packages that have problems so far: ### BigVAR [Fixed](https://github.com/wbnicholson/BigVAR/commit/927b0a47cf038d60d9a37739cd250339424664f7) on GitHub ### CorReg File `rechercheZ.cpp`, line 90-92: `imaxiter` is a numeric vector, so `imaxiter[0]` is of type "double". The constructor of Eigen vectors require integer argument, so `imaxiter[0]` should be changed to `static_cast<int>(imaxiter[0])`. File `rechercheZ_rejet.cpp`, line 60-62: Same as above. File `rechercheZ_relax.cpp`, line 58-60: Same as above. File `rechercheZ_sparse_relax.cpp`, line 66-68: Same as above. ### oem [Fixed](https://github.com/jaredhuling/oem/commit/6bc3bdb75cfedcb913f38e999adbb066babdb133) on GitHub **Update 03/22**: Already fixed in recent CRAN version. ### OpenMx Waiting [bug](http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1380) to be fixed. **Update 03/22**: The code is now able to compile, but has a new runtime error. ### TDA In Eigen 3.3, when one wants to define a new scalar type, one needs to define a new `typedef` named `Literal`. `src/CGAL/Sqrt_extension/Eigen_NumTraits.h`, line 31: Add one line `typedef Real Literal;`. `src/CGAL/Quotient.h`, line 860: Add one line `typedef CGAL::Quotient<NT> Literal;`. `src/CGAL/Mpzf.h`, line 1136: Add one line `typedef CGAL::Mpzf Literal;`. `src/CGAL/MP_Float.h`, line 889: Add one line `typedef CGAL::MP_Float Literal;`. `src/CGAL/Lazy_exact_nt.h`, line 1429: Add one line `typedef CGAL::Lazy_exact_nt<ET> Literal;`. `src/CGAL/Interval_nt.h`, line 1269: Add one line `typedef CGAL::Interval_nt<b> Literal;`. `src/CGAL/Gmpz.h`, line 216: Add one line `typedef CGAL::Gmpz Literal;`. `src/CGAL/Gmpq.h`, line 142: Add one line `typedef CGAL::Gmpq Literal;`. ### TMB A bit more complicated. The package depends on another library that has compatibility issues with Eigen 3.3, and I'm trying to find a way to fix it. I think we need to wait for all bugs being fixed before we can submit the updated `RcppEigen` to CRAN. **Update 02/18**: Issues of TMB have been [fixed](https://github.com/kaskr/adcomp/blob/master/TMB/NEWS#L14) by the author.

Topic		Replies	Views
Stan on the GPU Project Proposals	16	8523	August 10, 2018
Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups General	36	3449	March 5, 2018
GPU Update: what's up and where we are going Developers features , math	29	2601	November 12, 2018
ViennaCL with stan: Cholesky Benchmark Project Proposals stan-math	56	6265	July 23, 2018
Any way to make Stan competitive with Tensorflow for maximum likelihood? Algorithms	43	6267	July 31, 2019

Integrating GPU support

Related topics