I’m summarizing an email exchange we’ve been having over the past days, in case anyone wants to contribute ideas. We have a specific application in mind, but I’m not clear that I can share it, so for the moment I’ll focus on the general problem.
@andrewgelman writes:
Hi [all]. […] we’re all interested in scalable Bayesian multilevel modeling, and we now have a specific and live example.
Just as a refresher, here are some ideas we were talking about to speed such computations.
- Brute force approaches (i.e., still doing full Bayes, just computing more efficiently):
Parallel computing
GPU’s (I don’t know if they apply here or if that’s just for GP’s)
Precomputing X^tX or something similar
Faster adaptation and improved mass matrix
- Approximate computation:
Laplace
GMO
ADVI
EP
Maybe there are other ideas I’m forgetting?
I’m thinking we’d start with the normal model, then once we’ve made good progress on that we can go to logistic.
Another interesting dimension here is how to improve the model, adding interactions or whatever to go beyond the basic IRT framework.
Next, we have a few exchanges, notably about real world examples and also some simulated data. Hopefully, I can share in a future post. Then @anon75146577 comments:
There’s a pile of stuff we can do with the normal model with sparse matrices, but if it takes 2 hours in lmer I can’t imagine getting better than that.
One possible that we should look at is extending Tamara’s recent stuff on replacing big datasets with smaller weighted datasets
[1710.05053] Automated Scalable Bayesian Inference via Hilbert CoresetsAutomating that would be useful.
There a python library
GitHub - trevorcampbell/bayesian-coresets: Automated Scalable Bayesian Inference
Fake data would be awesome (or code to make fake data)
Rok is building out a branch right now with the GPU glm implementation. This could be a nice test benchmark
@Erik_Strumbelj replies:
I second Bob’s question - knowing how much a single iteration/gradient evaluation takes would help us gauge if GPU computation would solve the problem.
We’re experimenting with the GPU GLM stuff right now. From what I see, I’d say that for this type of model and this much data, we can expect a speedup of 20-30.
Next, @avehtari steps in:
Ok, I’ll focus on computational issues for [the application at the top of the post] as the normal model.
There is an excellent lmer document describing the computational details
https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
lmer forms two design matrices 1) dense matrix X for “fixed” effects which is
300000 x 40 and 2) sparse matrix Z which is 300,000 x (34+100000+150+119+5*34).
The matrix Z would be huge as dense matrix, but as a sparse matrix is very
sparse. lmer uses penalized least squares (see the above mentioned document for
the details). PLS is implemented in compiled C++ code using the Eigen
templated C++ package for numerical linear algebra and sparse matrices are
handled with the CHOLMOD library of C functions. lmer PLS is doing
marginalization so it’s more complex than what can be done in lm (OLS). I think
lmer is doing everything very well, and the main speed-up could come only from
parallelization, e.g. with GPUs. For a similar data size and SVM, 2017 GPU
speed-up was 20x, ie would reduce lmer time to 6 minutes.rstanarm is using the same dense + sparse design matrix approach, but Stan
lacks proper sparse matrix support which probably adds additional overhead.
- Brute force approaches (i.e., still doing full Bayes, just computing more
efficiently):
- Parallel computing
- GPU’s (I don’t know if they apply here or if that’s just for GP’s)
Yes for GPUs as mentioned above,
- Precomputing X^tX or something similar
I think lmer is doing all clever tricks already.
- Faster adaptation and improved mass matrix
MCMC would be slower than PLS even with these.
- Approximate computation:
- Laplace
This could maybe be close to lmer speed given we would have all the tricks they
have (e.g. efficient sparse matrices).
- GMO
- ADVI
These would be slower and so far have been quite unstable.
- EP
If this refers to EP parallelization, that would provide speedup with accuracy
decreasing with increasing number of parallel nodes. The model would fit the
model families described in EP-life paper and if dividing by state, the
dimensionality of shared parameters seems to be feasible.On 06/06/2019 05.45, Daniel Simpson wrote:
There’s a pile of stuff we can do with the normal model with sparse matrices,
but if it takes 2 hours in lmer I can’t imagine getting better than that.One possible that we should look at is extending Tamara’s recent stuff on
replacing big datasets with smaller weighted datasets
[1710.05053] Automated Scalable Bayesian Inference via Hilbert CoresetsIn that paper they are not considering hierarchical models, so for hierarchical
models would require a bit more thinking. For example if we think that data is
divided to cells, then each cell has small dataset and it depends a lot on the
model / prior structure what data we can drop without affecting too much the
quantities we care.On 06/06/2019 05.51, Steve Bronder wrote:
Rok is building out a branch right now with the GPU glm implementation. This
could be a nice test benchmarkFor this model and data, I think it would need to support sparse design matrix
as the dense design matrix has more than 30e9 elements.
Next I write:
Precomputing X^tX or something similar
My impression is that the more general idea here is to use sufficient statistics for the log likelihood and precompute the sufficient stats or parts thereof in the data block.
and @Bob_Carpenter replies to my remark:
It’s also covered in the efficiency chapter of the user’s guide.
As an example, I suggested replacingy ~ bernoulli(theta):
with the sufficient stat version
sum(y) ~ binomial(size(y), theta);
that’s the same up to an additive constant. And I also suggested
going further and exploiting conjugacy, so that if you a
likelihood and conjugate priortheta ~ beta(alpha, beta); // prior
y ~ binomial(N, theta); // likelihoodyou can replace it directly with the posterior
theta ~ beta(alpha + y, beta + N - y) // posterior
Again the expressions are the same function of theta up to
an additive constant.
@avehtari then tells me in person that lmer already implements all sorts of tricks with sufficient statistics and recommends reading their paper.
Hi, just to clarify: I don’t know that we need to be faster than lmer. I mean, if we are faster than lmer, that’s great. But if we can be as fast as lmer and do something more Bayesian (allowing informative priors, incorporating uncertainty better, allowing more flexible and complex models), that would be a plus.
Finally completing our list of usual suspects, @jonah writes:
rstanarm is using the same dense + sparse design matrix approach, but Stan
lacks proper sparse matrix support which probably adds additional overhead.Yeah I think it can be quite a bit of additional overhead for large Z matrices. I don’t think there’s anything else we can do in rstanarm about that unless we do someday have sparse matrix support. And this isn’t just about speed vs lmer but also about memory usage vs lmer.
I’m curious about Tamara’s work that Dan mentioned but haven’t read it yet. Will look into it.