I am really curious to explore whether GPU integration might have an effect on the run-time of any of Stan’s algorithms.

Specifically, I was curious in exploring:

Which algorithm would be a good place to start?

Where are the major computational bottlenecks in this algorithm?

What parts of this algorithm might lend themselves most to parallelism?

I emailed Bob, Alp, and Dan about this question and here is a summary of their responses. This thread is an attempt to bring that conversation into the community forum.

Bob:

I think the easiest thing to do might be to work on the
matrix algebra—Eigen has some GPU support for some of its
expensive operations. I don’t know anything about GPUs!

First step would be to get a no-holds-barred prototype working
then think about integrating into Stan.

Dan:

If you need a particular example, write a logistic regression using the Stan language, use CmdStan to generate the C++, then hack the C++ to use GPUs. If you can get that going, we can see what we can do to get it integrated all the way through. (Was that specific enough? It’d be easy enough to provide a Stan program if it isn’t.)

We’d want GPU for computation of the log joint probability distribution that’s defined by the Stan program. You won’t touch any of the algorithms. We spend very little time in the algorithm code. We spend almost all of the time in the computation of gradients of the log joint probability distribution.

I generated .cpp code for Logistic regression (as outlined in the dose example on page 76 of Gelman’s BDA3) and for the 8 Schools example.

In logistic_model.cpp, I’m seeing two log_prob functions, one on line 171

std::ostream* pstream__ = 0) ```
and one on line 234:
```T_ log_prob(Eigen::Matrix<T_,Eigen::Dynamic,1>& params_r,
std::ostream* pstream = 0)```
The second is a wrapper for the first, no? And am I right in saying that most of the math for calculating the log_probability is handled by
stan::math::accumulator<T__> lp_accum__;
?
Thanks so much,
Alex

Can you post the exact form of logistic regression (as a Stan program) that you’re using? Then we can talk through details. I don’t know too much about the details of GPU programming, so it’ll be me proposing things that I think could work and hopefully you can turn that into something that might be of use.

Sure — it is based off the model proposed for measuring how dose effects (BDA3, page 76) and here’s the stan code I used to generate the .cpp for it:

data {
int<lower=0> N;
int n[N];
vector[N] x;
int y[N];
}
parameters {
real alpha;
real beta;
}
model {
for (i in 1:N)
y[i] ~ binomial_logit(n[i], alpha + beta * x[i]);
}

GPU computation has grown more advanced — OpenCL has a variety of built-in mathematical functions like the sine function and even the Gamma function. (Full set of built-in functions listed on the second page of this reference card: https://www.khronos.org/files/opencl-1-2-quick-reference-card.pdf).

So, the kernel computations can get pretty extensive and intricate; thus, I’d say we should start by brainstorming any computational step that can be parallelized, not just limiting ourselves to thinking about simple, parallelizable computational steps. My instinct would be to start looking at the stan::math::accumulator class, but you probably have better instincts than me on this!

The accumulator class can be asynchronous—it just builds up
a collection (potentially with repeated entries). The order doesn’t
matter, but add requests do need to be carried out exactly once.

But the accumulation is trivial compared to operations like matrix
multiply, so I’m not sure what you’re hoping to gain by this.

The cost to build up the expression graph isn’t as expensive as the backward pass to compute the gradients. But almost all our time is spent in computing gradients (and rightly so!). The natural place to improve speed is to compute the gradients faster. We can’t parallelize MCMC because it is a Markov chain and it depends on the previous state. We can’t parallelize within an iteration in HMC algorithms because each step depends on the previous. So we can try to parallelize the computation within the Stan program.

Perhaps try a large matrix multiplication instead of a for loop? That might be more natural to parallelize.

Ah I understand. You mean the for-loop around the binomial-logit? Gotcha, I wasn’t sure what the computation was like under the hood but it makes sense to focus on a Stanmodel without a for. Do you think the schools model on the website is a good candidate?

data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // s.e. of effect estimates
}
parameters {
real mu;
real<lower=0> tau;
real eta[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] <- mu + tau * eta[j];
}
model {
eta ~ normal(0, 1);
y ~ normal(theta, sigma);
}

Or do you think the for in the transformed parameters block is also a problem?

It’ll take me a bit to digest that paper and go through some of the code. Would you happen to have any time tomorrow by phone or Thursday morning to give me some pointers? (I can summarize what we talk about on this thread!) Or I can take a couple of days to go through this and check back in?

To answer your question, I’m going to hedge and say that the speed-up GPUs deliver varies case-by-case and it’s worthwhile trying to test any code-segment in CUDA on a GPU that you’d be able to parallelize by other means on a CPU. In addition to basic matrix computation, I’d be interested in maybe exploring whether there are bit-wise vector operations that could be looked at?

Hi, a few years ago I experimented with GPUs for GPstuff. I was able to easily get significant speed-ups by using GPUs for computing Choleskys and products for big matrices. I would guess this would be easy way to speed up GPs and big GLMs in Stan, too.

You want to focus on GPUs for our matrix operations, which
are the major bottlenecks in bigger models and have a clear path
for GPU support through Eigen.

No point in parallelizing eight schools with GPUs—it runs in a fraction
of a second on an old notebook in one thread.

Had a productive meeting with Bob and Dan today — thank you so much for taking the time to meet with me. We talked about several paths to explore:

Matrix multiplications: This seems like the most promising high-impact path to look at first, as basic matrix operations occur throughout the code. A slight speed up would thus have a large downstream effect.

1a. Looking solely at Eigen, can we start to profile the speedups they achieve on GPU vs. CPU. This would start by taking Eigen out of the context of Stan.

1b. We can also start to look at integrating Eigen 3.0 (the release of Eigen with GPU support) into Stan and seeing how this alone increases speed.

1c. Further exploration can be done with the people involved in the Eigen project to see if they need help implementing some matrix operations that will prove crucial to Stan but that they have not yet implemented.

Thanks for coming up. Always happy to meet and discuss Stan.

A few things to note: matrix operations will have a wider impact than the ode solvers, but both are important. The Eigen version we are trying to target is 3.3 (not 3.0).

Do you still need a Stan program to test with? If so, look at the gaussian process examples in the example-models repo.

This paper is a mess. They confound gradients with cost (with autodiff gradients cost the same order as the function evaluation itself – there is a higher cost per transition, but that’s because more evaluations are used for better overall performance) and use static HMC and terrible heuristics to try to get around some of the problems with it. But really they fit a single, simple model that reduces to matrix algebra amendable to GPUs. That’s it. Nothing special. Also they made some weird statement about Stan not being able to compute their model which is just completely wrong.

What we very much would like to be able to do is
turn on GPU support for our matrix operations so that when
we have models with those matrix operations, they’ll go
faster. I believe that’s supported through Eigen as of
3.3, but I have no idea how to turn it on or use it.

A tricky thing here is that as far as I understand, there’s a fair bit of latency involved in copying data to the GPUs and back so we don’t necessarily want to ship all matrix operations over to GPUs (especially if there’s a lot of back and forth in a model). But we can do some empirical tests and see if we can figure out some criteria to use for deciding when to ship it over or if perhaps it ends up just being worth it all the time. I’d love to take a look at the Eigen configuration required for this at some point. I seem to recall a discussion about when to switch to Eigen 3.3 but can’t find it now - is that waiting for C++11 in April as well?