I’m summarizing an email exchange we’ve been having over the past days, in case anyone wants to contribute ideas. We have a specific application in mind, but I’m not clear that I can share it, so for the moment I’ll focus on the general problem.

@andrewgelman writes:

Hi [all]. […] we’re all interested in scalable Bayesian multilevel modeling, and we now have a specific and live example.

Just as a refresher, here are some ideas we were talking about to speed such computations.

- Brute force approaches (i.e., still doing full Bayes, just computing more efficiently):

Parallel computing

GPU’s (I don’t know if they apply here or if that’s just for GP’s)

Precomputing X^tX or something similar

Faster adaptation and improved mass matrix

- Approximate computation:

Laplace

GMO

ADVI

EP

Maybe there are other ideas I’m forgetting?

I’m thinking we’d start with the normal model, then once we’ve made good progress on that we can go to logistic.

Another interesting dimension here is how to improve the model, adding interactions or whatever to go beyond the basic IRT framework.

Next, we have a few exchanges, notably about real world examples and also some simulated data. Hopefully, I can share in a future post. Then @Daniel_Simpson comments:

There’s a pile of stuff we can do with the normal model with sparse matrices, but if it takes 2 hours in lmer I can’t imagine getting better than that.

One possible that we should look at is extending Tamara’s recent stuff on replacing big datasets with smaller weighted datasets

https://arxiv.org/abs/1710.05053Automating that would be useful.

There a python library

https://github.com/trevorcampbell/bayesian-coresets

Fake data would be awesome (or code to make fake data)

Rok is building out a branch right now with the GPU glm implementation. This could be a nice test benchmark

@Erik_Strumbelj replies:

I second Bob’s question - knowing how much a single iteration/gradient evaluation takes would help us gauge if GPU computation would solve the problem.

We’re experimenting with the GPU GLM stuff right now. From what I see, I’d say that for this type of model and this much data, we can expect a speedup of 20-30.

Next, @avehtari steps in:

Ok, I’ll focus on computational issues for [the application at the top of the post] as the normal model.

There is an excellent lmer document describing the computational details

https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf

lmer forms two design matrices 1) dense matrix X for “fixed” effects which is

300000 x 40 and 2) sparse matrix Z which is 300,000 x (34+100000+150+119+5*34).

The matrix Z would be huge as dense matrix, but as a sparse matrix is very

sparse. lmer uses penalized least squares (see the above mentioned document for

the details). PLS is implemented in compiled C++ code using the Eigen

templated C++ package for numerical linear algebra and sparse matrices are

handled with the CHOLMOD library of C functions. lmer PLS is doing

marginalization so it’s more complex than what can be done in lm (OLS). I think

lmer is doing everything very well, and the main speed-up could come only from

parallelization, e.g. with GPUs. For a similar data size and SVM, 2017 GPU

speed-up was 20x, ie would reduce lmer time to 6 minutes.rstanarm is using the same dense + sparse design matrix approach, but Stan

lacks proper sparse matrix support which probably adds additional overhead.

- Brute force approaches (i.e., still doing full Bayes, just computing more

efficiently):

- Parallel computing
- GPU’s (I don’t know if they apply here or if that’s just for GP’s)
Yes for GPUs as mentioned above,

- Precomputing X^tX or something similar
I think lmer is doing all clever tricks already.

- Faster adaptation and improved mass matrix
MCMC would be slower than PLS even with these.

- Approximate computation:

- Laplace
This could maybe be close to lmer speed given we would have all the tricks they

have (e.g. efficient sparse matrices).

- GMO
- ADVI
These would be slower and so far have been quite unstable.

- EP
If this refers to EP parallelization, that would provide speedup with accuracy

decreasing with increasing number of parallel nodes. The model would fit the

model families described in EP-life paper and if dividing by state, the

dimensionality of shared parameters seems to be feasible.On 06/06/2019 05.45, Daniel Simpson wrote:

There’s a pile of stuff we can do with the normal model with sparse matrices,

but if it takes 2 hours in lmer I can’t imagine getting better than that.One possible that we should look at is extending Tamara’s recent stuff on

replacing big datasets with smaller weighted datasets

https://arxiv.org/abs/1710.05053In that paper they are not considering hierarchical models, so for hierarchical

models would require a bit more thinking. For example if we think that data is

divided to cells, then each cell has small dataset and it depends a lot on the

model / prior structure what data we can drop without affecting too much the

quantities we care.On 06/06/2019 05.51, Steve Bronder wrote:

Rok is building out a branch right now with the GPU glm implementation. This

could be a nice test benchmarkFor this model and data, I think it would need to support sparse design matrix

as the dense design matrix has more than 30e9 elements.

Next I write:

Precomputing X^tX or something similar

My impression is that the more general idea here is to use sufficient statistics for the log likelihood and precompute the sufficient stats or parts thereof in the data block.

and @Bob_Carpenter replies to my remark:

It’s also covered in the efficiency chapter of the user’s guide.

As an example, I suggested replacingy ~ bernoulli(theta):

with the sufficient stat version

sum(y) ~ binomial(size(y), theta);

that’s the same up to an additive constant. And I also suggested

going further and exploiting conjugacy, so that if you a

likelihood and conjugate priortheta ~ beta(alpha, beta); // prior

y ~ binomial(N, theta); // likelihoodyou can replace it directly with the posterior

theta ~ beta(alpha + y, beta + N - y) // posterior

Again the expressions are the same function of theta up to

an additive constant.

@avehtari then tells me in person that lmer already implements all sorts of tricks with sufficient statistics and recommends reading their paper.

Hi, just to clarify: I don’t know that we need to be faster than lmer. I mean, if we are faster than lmer, that’s great. But if we can be as fast as lmer and do something more Bayesian (allowing informative priors, incorporating uncertainty better, allowing more flexible and complex models), that would be a plus.

Finally completing our list of usual suspects, @jonah writes:

rstanarm is using the same dense + sparse design matrix approach, but Stan

lacks proper sparse matrix support which probably adds additional overhead.Yeah I think it can be quite a bit of additional overhead for large Z matrices. I don’t think there’s anything else we can do in rstanarm about that unless we do someday have sparse matrix support. And this isn’t just about speed vs lmer but also about memory usage vs lmer.

I’m curious about Tamara’s work that Dan mentioned but haven’t read it yet. Will look into it.