Making big kernels work


Hi - I’m told the birthday problem can work with 7200x7200 kernel. I’m not sure how the memory is managed in GPstuff, but in Stan I’m almost crashing my computer. (can’t run top because it stops working, ). I think everything is loaded on one core, so it would help to parallelize some of the bigger matrix computations.

There are parallel cholesky algorithms, but I think Eigen’s already thought this out… looks like there’s an openMP compiler flag that will automatically parallelize some of these computations. If I’ve enabled the MPI enough to at least run the tests, are these computations already enabled or need I do more? going to give it a shot right now but throwing it out there


Would it be possible to paste the model and the data or send it to me via e-mail? I am interested to see what we would get with GPU help.

The stancon2018 branch on our fork ( runs the cholesky decompose, some matrix multiplications, mdivide_left_tri and some parts of the multi_normal_lpdf on the GPU.


yeah, I’ll upload the model/data in a sec, so you can just run it down. Let me clean it up so we’re running the exact same thing. I’m going to run 8 cores on MPI and I want to compare the speed


With these two files after model is compiled you can just run ./birthday_demo2 sample data I think 2.18 or the develop branch has what you need -GPU stuff.

birthday_demo2.stan (8.6 KB) (1.3 MB)



I just have to snyc the fork as our math is missing things like gp_periodic_cov that were added recently and then we are good to go.


yeah pretty much whatever I’m doing, with NUTS I’m having memory leaks, at least with this stan model.

Give me a sec and I’m going to send by another model that takes a kernel out of scope. This is poor memory management on my part. I’m just going to omit the predictive posteriors for now.

Do GPU’s generally have a larger memory capacity?

I’m going to have to look deeper into Stan’s memory arena then…

If you can’t get this to go I can send by some other GP model, but I think in the demo you guys already tested up to N=1000.

@Bob_Carpenter sorry to get hopes up but not sure if this is possible… there’s still hope though


stanc does not seem to find the gp_periodic_cov. I can see that it is defined in stan math, but not in stan? Did you add something on your part or am I missing something?


That really depends on the GPU and its price. Scientifc GPUs have up to 12GB of global memory, mid and high-end gaming GPUs have 4-8GB of global memory, older GPUs usually have something in the range of 1-4GB.


yeah, forgot about that. I’m copying some code that will enable you to use gp_periodic_cov, and also the dot product kernel. In cmdstan, the directory is stan/src/stan/lang/function_signatures.h. If this doesn’t work I made a cmdstan 2.18 that I can upload, that has the MPI stuff, too.
temp.txt (1.5 KB)

But more important: Don’t run this. The memory requirements are too much for the GPU you’re using for this problem. The simplest thing I got to run related to this problem was sum of two kernels (periodic*squared_exp + dot_product) for an N=365 problem, and then subsets of the data, albeit I wasn’t very organized about it. The sum of kernels makes the memory requirement too large for this problem (and all this time I was worried about speed…). I can’t find a reference for this but Aki digged it up somewhere… “Autodiff memory will be about 40 bytes per subexpression evaluated in the expression graph”.

So… we have N * N * #of parameters we need something like 54 GB for this model. I guess I’ll write code that’s realistic with less kernels, but it will be more of a toy model for testing GPU/MPI speed, and when I do that, I’ll ping you. I mean you can run it if you want, but I wouldn’t if you value your time.


It hasn’t been added to the language yet.


@drezap Please ping me, this seems like a really interesting case for GPU speedups.

In the meantime, I prepared a quick GPU speedup test for the stan::math functions that are used inside your model. This way we can see what can be achieved now with what is already implemented and awaiting pull requests in our fork. The below results are for functions in /prim. If anyone is interested, I can prepare the same for the derivatives in /rev.

I tested this with 2 GPUs:

  • NVIDIA Titan Xp ( a high end GPU, it does lack a bit of double precision performance so it produces comparable results to the 400-500$ “gaming” GPUs )
  • Tesla V100 ( currently the best GPU from NVIDIA, costs about 2$/hour on the google cloud/ AWS)

Matrix multiplication of 2 NxN matrices

The X-axis represents N. A maximum speedup of 20 for the TitanXp and 128 for the Tesla V100. Copying data to and from the GPU represents 70-80% of the execution time.



The X-axis represents the dimension of the input square matrix. A maximum speedup of 31 for the TitanXp and 125 for the Tesla V100.


cholesky decomposition of a NxN matrix

The X-axis represents N. A maximum speedup of 10 for the TitanXp and 18for the Tesla V100.


For the cov_exp_quad the speedup is ~5-6 for both GPUs as its not really a complex function. I would expect about the same speedup for the gp_periodic_cov and gp_dot_prod_cov.

The bottlenecks for the multi_normal_cholesky_lpdf is mdivide_left_tri and matrix multiplications, while cholesky decomposition is the bottleneck for the multi_normal_rng. So expected speedups can be based on the speedups of those functions.


Great work! Thanks for the visualizations, too! I really appreciate this!

yeah, this is as expected. Did you calculate ratio of total run time of all of these functions with the total program? I’m curious usually cholesky decomp is taking up a lot of time but it looks like in our Stan programs a lot more time is being taken up elsewhere, i.e. matrix multiplication and some of the RNG’s.

In the meantime, I’m attaching an ARD model for the exponential quadratic kernel, as in this paper: Model selection for Gaussian processes utilizing sensitivity of posterior predictive distribution.

I think they use a reduced size dataset for some of the experiments. Some times with with ARD the sampling is faster because the model is more flexible (kind of the same effect on sampling as adding more flexibility with hierarchical priors), so I’d be interested in seeing total execution time. Also, the ARD kernel will probably be slower because it’s not specialized for reverse more autodiff, so id be interested in comparing w a non ARD model.

I’m attaching…

  1. the ARD model, concrete.stan that works for all of these…
  2. 4 datasets: concrete, boston housing, automobile, crime, and then some …
  3. prim code for the ARD (that isn’t merged into dev yet and is cued up), and then the…
  4. function signatures code. I can format the data for you, if you really want, just ping me.

Any reason this hasn’t been added? I know you asked for a computation of a distance matrix that calculates the lower triangular (because ditsance) is being called over and over, but the PR was up for a long time and I was just trying to push it through. Is this not up to standards? Honestly, as I’m writing this, probably best to implement a distance function because it’s easier to edit the kernels now…

concrete.stan (447 Bytes)
Concrete_Readme.txt (3.7 KB)
concrete.hpp (16.6 KB)
format_data.R (610 Bytes) (48.5 KB)
Concrete_Data.csv (56.9 KB) (40.7 KB)
housing_readme.txt (2.0 KB)
imports-85_readme.txt (4.6 KB)
imports-85.txt (25.3 KB)
housing.txt (47.9 KB)
cov_exp_quad.hpp (8.1 KB)
temp.txt (2.0 KB)

I think for two of these datasets it’s formatted…

and then next I’ll send you the largest sum with N=7200 I can get to work after I instrument the code…

Is there anyway you can parsimoniously share some of the code/how you’re testing run time?

Another interesting case would be a Student-t likelihood (sampling from it), a model of which would look something like this:

model {
parameters {
  real<lower=0> magnitude;
  real<lower=0> length_scale;
  real a;
  vector[N_tot] eta;
transformed parameters {
  vector[N_tot] f;
  matrix[N_tot, N_tot] L_K;
    matrix[N_tot, N_tot] L_K;
    matrix[N_tot, N_tot] K;   
    K = cov_exp_quad(x_tot, magnitude, length_scale);
    for (n in 1:N_tot)
      K[n, n] = K[n, n] + 1e-12;

    L_K = cholesky_decompose(K);
    f = L_K * eta;
model {
  magnitude ~ normal(0, 1);
  length_scale ~ inv_gamma(5, 5);
  eta ~ normal(0, 1);

  y ~ student_t(3, f, 2);

Not sure if this is accurate or will compile, but I can put something together.


Thanks. I am going to try and run these models. Just to double check if I understand correctly what I have to do:

  1. add the cov_exp_quad.hpp code in prim
  2. add the function signatures in function_signatures.h
  3. use and with concrete.stan

Should I just run this with the default parameters?

The code is all here with the instructions on how to use is posted on the wiki

The execution times of prim functions are measured here

The cholesky_decompose has a separate cholesky_decompose_gpu function that uses the GPU, for the others we simply put in #ifdef statements in regular functions like this

In order to get the speedups you have to run the speedups_test twice, once with STAN_OPENCL defined and once without it. Dont forget to run make clean in between.

This is one thing we are still investigating how to do it more efficiently. Currently we measure the time for the whole transition in stan/services/util/generate_transitions.hpp and then measure times in all the relevant functions in /rev and /prim and stop once we see that the execution times of /rev and /prim functions add up to the transition execution time or close to it. Its pretty time consuming but we havent found a better way yet. We offcourse then confirm everything by running the whole model a few times on the CPU and GPU to see if we get consistent speedup numbers.

Your birthday_demo2 model stops due to bad_alloc so we havent tested it fully to give you a ratio unfortunately.


No, just replace the cov_exp_quad.hpp entirely. But yeah, I think that’s all the steps. There’s two more datasets, I just didn’t put them in the format yet.

Yeah, should be fine. I’m not totally sure, though. I can ask the authors what priors they used for the papers, or have them send the code by. I was searching around for pointers on hyper parameter priors for GP models and apparently it’s still an open problem. Some suggestions: center the length scale larger than the smallest interval in your data, and a half cauchy for length scale isn’t recommended. You can check out “Prior Formulation for Gaussian Processes Hyper Parameters” Trangucci et al, and that has some suggestions and study. There’s also suggestions in the Stan manual, and that’s been suggested to me a few times. If you find anything else, please do share. For BDAY the priors are centered on, and for optimization initialized at the known values. (24.2 KB) (161.5 KB)

Yes, I’m aware, I mentioned this earlier, attached is a subset of the data that shouldn’t have those issues (but not realistic, we will not see the periodic trends as much as the full model). If it still does, let me know and I’ll reduce the dataset size. I’m currenlty messing with valgrind right now trying to quantify how much memory these models are using with different sized kernels. I believe the sum of kernels is causing the memory issue. When I find the largest possible sum of kernels, I can send another case.

EDIT: the trangucci paper isn’t on arxiv, and I can’t find it easily with google searches, as I had to do some digging, but feel free to email me if you want it. I’ve told @rtrangucci to upload it but it seems he hasn’t yet :)


Part 3 of @betanalpha’s GP case study:

edit: It’s for GPs with squared exponential kernels but I assume it’s similar to what you’d wanna do for other ones. I’m not really familiar with them though.



At one point I had std::bad_alloc w/ 20 years of data and was peaking out at 17.6 of memory, but now it’s working fine for the whole dataset, and my workflow is horrendous and I’m on very little sleep, so I can’t tell why. Now I’m peaking out at 7.4 GB, which is totally acceptable, and the thing is running. Anyway… this is the full model. If I get bad_alloc with this I think I’ll throw my computer at the wall.

Please throw -pg in the compiler flags in cmdstan’s makefile and run both valgrind --tool=massif ./birthday_demo2 sample data and valgrind --tool=massif ./birthday_demo2 optimize data, and we’ll see what happens. Please share the massif.out files as well.

Also if you experience bad_alloc please run the reduced size up to 10 years, which is

I’m going to run optimize and sample for defaults on 6 cores with MPI, and we can (hopefully) compare speed.

I’ve also updated the pred_rng's to not take in the whole kernel and take in parameters because the kernel is a memory hog (but I don’t think this was the issue, as it was out of scope during some tests).

birthday_demo2.stan (13.8 KB) (687.4 KB) (1.3 MB)


Thanks, I will try this out.

Just curios, what is initally reported as the time to calculate the gradient on your system?


Calculate the gradient? Do you mean evaluate initial joint log probability? For 1,2,3 years the run time was under 15 minutes whereas for N=3000 I remember it taking 40 minutes. This iteration I hadn’t run yet


I’ll get it running morning and let you know. I didn’t get it running, I had to leave the office


I mean the gradient evalution that is printed in the begining “Gradient evaluation took XYZ seconds”