Making big kernels work


I had only been using optimization w/ valgrind to make sure there was not memory issues today. I left office after I posted that code. I can check now


started it up now with out MPI


I ran it locally and my computer’s frozen, I may have to reboot and run it from server and let you know tomorrow; but I’ll let this run for another hour or so before I kil it


Yeah no problem Andre. Take your time.

I did some preliminary time measurements on the gradients with N=3350 and can see that with cholesky, mdivide_left_tri running on the GPU, gp_periodic_cov and gp_dot_prod_cov are responsible for 80%+ of the execution time. I will move this to the GPU tomorrow and see what we can get.


I just ran w/ the above specs and got the message “Killed”, which I think means my kernel generated the message:, and not stan.

Ok… that’s good to know. gp_dot_prod_cov isn’t rev'ed so I can put that together. I have test cases already written somewhere.

And I can speed up gp_periodic_cov by pre-computing a distance matrix that we can then input into the kernels. That should help… I didn’t profile that and I was just trying to merge a PR (pressure to “publish” producing poor work at it’s finest…). So I’ll go back and try to optimize it further, and actually profile the thing.

I’ll get it running with MPI tomorrow. I don’t know about the full 20 year kernel with NUTS would be good but may be 10 years and optimization and we can have ‘results.’ For BDA3 version I think laplace approximation was used anyway, so L-BFGS should be cool. I’m going to spend less time on this problem, I’ve been here for a while and accomplished nothing… need to produce something that people can see, so I’m not looking incompetent.

I can also share some of the heap profiling tomorrow.


You account for (via Github) about 50% of my incoming mail. The contents seem valuable enough – certainly compared to other things I read on the internet. There’s lots of stuff in Stan that’s a long way from perfect anyway. That’s why we’re still working on it.

Anyway, I’m a 7th year Phd. So phooey on your being here a while and accomplishing nothing :P. I’m the expert.


I ran it locally and computer is frozen… I may have to reboot and run it from a server and tell you tomorrow, but I’ll let it run another hour or so before I kill it



Hey - I’m realizing with valgrind sometimes it won’t catch the bad_alloc… it bugs at line 202 which when we’re allocating memory for the kernel (valgrind does some internal memory management?). So for this I need to just write a c++ program that allocates for the kernel and memory for the gradients of that, and take a dive into what’s happening (as opposed to naively writing a Stan program and crossing my fingers). I’m going to mark the full birthday problem as experimental (N=7300, whereas the 10 year dataset works fine for optimization, running now w/ no issues, and from experiments yesterday this is the largest I could get to run without bad_alloc).

So at least for a use case for the GPU, let’s use 10 year dataset (, and I’ll run both L-BFGS and NUTS, without MPI (I ran mpirun -np 8 ./birthday foo bar and it just runs 8 of the same processes, I don’t think I’ve set it up properly, I think it’s suppose to parallelize some likelihood computations or sth).

And then we can time and compare speed. Don’t run with valgrind. And I’m attaching some of the massif outputs…

Bob sorry for responding to you and blowing up your inbox I was replying via email…

But use this birthday_demo2.stan, where I’ve omitted the covariance for predictive distribution and I’ll only be using predictive mean (the predictive covariance isn’t stable, for me, at least).

birthday_demo2.stan (14.0 KB)

and with that, I’m going to move onto some other problem…


The last graph I saw had a much higher speedup for Cholesky, but I think that was not assuming the data needed to be transported. We actually have a bunch of different use cases. For Cholesky, we are usually decomposing something like a GP covariance function. For example, squared exponential takes a fixed N x K matrix of predictors x, a K-vector of length scales and an overall scale parameter. And then we need the Cholesky of the derived matrix.

The case we probably use the most for matrix multiplication, because it’s the core of regression, is where x is an N x K data matrix for N >> K and beta is a K x 1 coefficient vector. We care about the N x K Jacobian of d.(x * beta) / d.beta. But even there, we use it in a (G)LM setting. For example, for linear regression, we have an N x 1 vector of observation y and we care about lp = -0.5 * dot_self(y - x * beta)) / sigma^2 along with d.lp / d.beta and d.lp / d.sigma.


Did you mean Ben, aka @bbbales2?


The speedups here are for the cholesky_decompose, I think that the last graph I posted way back was for the gradient. The speedups there are much larger, I will post them once I make some proper performance unit tests that I can include in the repository and people can easily repeat.

The execution time of the gradient is also much bigger than the cholesky, so the effect of the speedup is much larger.

Yeah, Erik mentioned to me that already, that this graph might not be representable for the use-cases in Stan. The number of multiplications in the type of matrix multiplication you mention is much smaller than in the square matrix case I posted, so the speedups will not be as big, at least not until I figure out how to avoid transfering the data matrices from the CPU to the GPU in each iteration.


Nice. We care about both for Stan, since we use the double-based calculations in places and need gradients in others.


I’m a little confused - seems like the Out Of Memory (OOM) killer has killed your process, so you’re somehow running out of RAM (which is consistent with your computer grinding to a halt - try turning off or decreasing swap so it dies more quickly rather than needing a hard reset). GPUs typically have much less RAM than the computers they are installed in, so wouldn’t that also run out of memory? Or you are thinking you could harness both memories together?


I’ve been running with CPU, GPU ran out of memory as well. I just told Rok above not to run the 20 year example of the birthday problem.

This had crossed my mind. Not GPU/CPU but any combination of the two (at first it would be good to just get CPU running). What’s eating up most of the memory is the size of the autodiff stack for the sum of kernels. (We have N(N-1)/2 * number of expressions) to evaluate in autodiff). Would it be possible to divide the kernel’s autodiff stack between CPU’s? If so, this would be possible (esp on GPU) and most of the computation time would be due to copying over to the different cores.

Looks like this is a possibility. For a higher level description, may be I could use something like boost’s shared memory object when initializing the kernel, and autodiff stack for it. This would require a special function for gaussian process kernels, though.

Actually, I think it’s a good idea to come up with a prototype for this. Here are some initial thoughts, after skimming the boost page above. Said GP kernel would require shared_memory_object, and a file_mapping_object to created a shared_memory object, and require that I specify the address and size of what I want to allocate. Just throwing stuff out there… This could possibly interfere with other MPI or GPU stuff, and also only for one chain.

@seantalts, thanks for the conversation gives me a direction to head in.

good suggestion.


I’m hijacking this thread but:

@rok_cesnovar regarding the priors for the “Model Selection for Gaussian Processes utilizing…”, after some discussion w/ @jpiironen and Topi Paananan here is some more information.

The code from the paper is here:
I think it was a student-t for the marginal variance (which I call the magnitude) and then inverse gamma centered around 1 for the length scale). I can’t find it from the code easily.

Here’s an intuitive description of these parameters (at least for RBF/Squared exponential kernel), just throwing information out there:

The length scale determines how much two points are related along one axis (somewhere is GPML). If we choose a prior with a very long thin tail and a peak (like a student-t with a high degree of freedom), the length scale will have a hard time going to larger values, and this can be bad if we think two axis may be related.

The magnitude (or marginal variance) determines how much we allow the signal to vary vertically. If the magnitude is 0, then we essentially have a flat signal, whereas if it’s far from one the signal will be allowed to vary more.

I hope this helps, and if anyone has better knowledge please chime in… or this worth another thread.