Scalable Bayesian multilevel modeling

I’m summarizing an email exchange we’ve been having over the past days, in case anyone wants to contribute ideas. We have a specific application in mind, but I’m not clear that I can share it, so for the moment I’ll focus on the general problem.

@andrewgelman writes:

Hi [all]. […] we’re all interested in scalable Bayesian multilevel modeling, and we now have a specific and live example.

Just as a refresher, here are some ideas we were talking about to speed such computations.

  1. Brute force approaches (i.e., still doing full Bayes, just computing more efficiently):
  • Parallel computing

  • GPU’s (I don’t know if they apply here or if that’s just for GP’s)

  • Precomputing X^tX or something similar

  • Faster adaptation and improved mass matrix

  1. Approximate computation:
  • Laplace

  • GMO

  • ADVI

  • EP

Maybe there are other ideas I’m forgetting?

I’m thinking we’d start with the normal model, then once we’ve made good progress on that we can go to logistic.

Another interesting dimension here is how to improve the model, adding interactions or whatever to go beyond the basic IRT framework.

Next, we have a few exchanges, notably about real world examples and also some simulated data. Hopefully, I can share in a future post. Then @Daniel_Simpson comments:

There’s a pile of stuff we can do with the normal model with sparse matrices, but if it takes 2 hours in lmer I can’t imagine getting better than that.

One possible that we should look at is extending Tamara’s recent stuff on replacing big datasets with smaller weighted datasets
https://arxiv.org/abs/1710.05053

Automating that would be useful.

There a python library
https://github.com/trevorcampbell/bayesian-coresets

@Stevo15025:

Fake data would be awesome (or code to make fake data)

Rok is building out a branch right now with the GPU glm implementation. This could be a nice test benchmark

@Erik_Strumbelj replies:

I second Bob’s question - knowing how much a single iteration/gradient evaluation takes would help us gauge if GPU computation would solve the problem.

We’re experimenting with the GPU GLM stuff right now. From what I see, I’d say that for this type of model and this much data, we can expect a speedup of 20-30.

Next, @avehtari steps in:

Ok, I’ll focus on computational issues for [the application at the top of the post] as the normal model.

There is an excellent lmer document describing the computational details
https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
lmer forms two design matrices 1) dense matrix X for “fixed” effects which is
300000 x 40 and 2) sparse matrix Z which is 300,000 x (34+100000+150+119+5*34).
The matrix Z would be huge as dense matrix, but as a sparse matrix is very
sparse. lmer uses penalized least squares (see the above mentioned document for
the details). PLS is implemented in compiled C++ code using the Eigen
templated C++ package for numerical linear algebra and sparse matrices are
handled with the CHOLMOD library of C functions. lmer PLS is doing
marginalization so it’s more complex than what can be done in lm (OLS). I think
lmer is doing everything very well, and the main speed-up could come only from
parallelization, e.g. with GPUs. For a similar data size and SVM, 2017 GPU
speed-up was 20x, ie would reduce lmer time to 6 minutes.

rstanarm is using the same dense + sparse design matrix approach, but Stan
lacks proper sparse matrix support which probably adds additional overhead.

  1. Brute force approaches (i.e., still doing full Bayes, just computing more
    efficiently):
  • Parallel computing
  • GPU’s (I don’t know if they apply here or if that’s just for GP’s)

Yes for GPUs as mentioned above,

  • Precomputing X^tX or something similar

I think lmer is doing all clever tricks already.

  • Faster adaptation and improved mass matrix

MCMC would be slower than PLS even with these.

  1. Approximate computation:
  • Laplace

This could maybe be close to lmer speed given we would have all the tricks they
have (e.g. efficient sparse matrices).

  • GMO
  • ADVI

These would be slower and so far have been quite unstable.

  • EP

If this refers to EP parallelization, that would provide speedup with accuracy
decreasing with increasing number of parallel nodes. The model would fit the
model families described in EP-life paper and if dividing by state, the
dimensionality of shared parameters seems to be feasible.

On 06/06/2019 05.45, Daniel Simpson wrote:

There’s a pile of stuff we can do with the normal model with sparse matrices,
but if it takes 2 hours in lmer I can’t imagine getting better than that.

One possible that we should look at is extending Tamara’s recent stuff on
replacing big datasets with smaller weighted datasets
https://arxiv.org/abs/1710.05053

In that paper they are not considering hierarchical models, so for hierarchical
models would require a bit more thinking. For example if we think that data is
divided to cells, then each cell has small dataset and it depends a lot on the
model / prior structure what data we can drop without affecting too much the
quantities we care.

On 06/06/2019 05.51, Steve Bronder wrote:

Rok is building out a branch right now with the GPU glm implementation. This
could be a nice test benchmark

For this model and data, I think it would need to support sparse design matrix
as the dense design matrix has more than 30e9 elements.

Next I write:

Precomputing X^tX or something similar

My impression is that the more general idea here is to use sufficient statistics for the log likelihood and precompute the sufficient stats or parts thereof in the data block.

and @Bob_Carpenter replies to my remark:

It’s also covered in the efficiency chapter of the user’s guide.
As an example, I suggested replacing

y ~ bernoulli(theta):

with the sufficient stat version

sum(y) ~ binomial(size(y), theta);

that’s the same up to an additive constant. And I also suggested
going further and exploiting conjugacy, so that if you a
likelihood and conjugate prior

theta ~ beta(alpha, beta); // prior
y ~ binomial(N, theta); // likelihood

you can replace it directly with the posterior

theta ~ beta(alpha + y, beta + N - y) // posterior

Again the expressions are the same function of theta up to
an additive constant.

@avehtari then tells me in person that lmer already implements all sorts of tricks with sufficient statistics and recommends reading their paper.

@andrewgelman:

Hi, just to clarify: I don’t know that we need to be faster than lmer. I mean, if we are faster than lmer, that’s great. But if we can be as fast as lmer and do something more Bayesian (allowing informative priors, incorporating uncertainty better, allowing more flexible and complex models), that would be a plus.

Finally completing our list of usual suspects, @jonah writes:

rstanarm is using the same dense + sparse design matrix approach, but Stan
lacks proper sparse matrix support which probably adds additional overhead.

Yeah I think it can be quite a bit of additional overhead for large Z matrices. I don’t think there’s anything else we can do in rstanarm about that unless we do someday have sparse matrix support. And this isn’t just about speed vs lmer but also about memory usage vs lmer.

I’m curious about Tamara’s work that Dan mentioned but haven’t read it yet. Will look into it.

11 Likes

Couple questions on lme4 and random effects models,

It looks like lme4 is doing this thing where it integrates out random effects. They end up with this iterative back and forth least squares thing. Why integrate out the random effects on the fly like this? Why not just compute the MAP of the whole posterior, do a normal approximation to the posterior around that, and then integrate all the random effects out of that in closed form? The first seems more difficult. Is the second thing a really bad approximation? Or is it flawed in some way I’m not considering?

From the lme4 paper, “Z is the n×q model matrix for the q-dimensional vector-valued random effects variable”. Is Z estimated as part of the problem? Or is this the design matrix attached to the random effects?

Is the solution from lmer not useful?

I am just curious what the intent of this is… just making it work in this instance in Stan…or do we expect to improve things by doing it in Stan?

(I don’t want to derail this thread, so feel free to ignore)

It’s not defined. When you have a vector \alpha with prior \alpha \sim \textrm{normal}(\mu, \sigma), there’s no bound on the density as \alpha_i \rightarrow \mu and \sigma \rightarrow 0. One solution is to introduce zero-avoiding priors like lognormal, but that changes the model.

It’s the design matrix.

Another question, the regression was originally included this term:

(age, educ, income, religiosity, potus2016 | state_code)

Is that the same as:

(age | state_code) + (educ | state_code) + (income | state_code) + (religiosity | state_code) + (potus2016 | state_code)

?

Anyway, I assume this was true and put together some generated data to figure out how fast the gradient evals would be.

The brms regression I used for the fake data looks like:

responses ~ (1 | state) + (1 | respondent) + (1 | dem) + (1 | rep) +
    educ * nonwhite + inc * nonwhite + relig * nonwhite + 
    V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + 
    V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 + 
    V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31 + 
    V32 + V33 + V34 + V35 + V36 + V37 + V38 + V39 + V40 +
    (age - 1 | state) +
    (educ - 1 | state) +
    (inc - 1 | state) +
    (relig - 1 | state) +
    (potus2016 - 1 | state)

The Vs are the respondent, state, and candidate covariates.

The script to generate fake data and whatnot is here:
fake_election_data.R (7.3 KB)

It looks like with 300k responses and 100k people, the gradient evals were taking 250 milliseconds on my computer.

@charlesm93 this is the kind of model where we can do the Laplace approximation thingy to integrate out the 100k individual level parameters, right? What’s the impediment to doing that?

If I understand this correctly, other than those individual effects, there are like 500 other parameters in this model, which seems super do-able.

Edit: Updated data generating script. I uploaded an old one

2 Likes

The Gaussian case is doable once we have sparse matrices. The non-Gaussian case is doable once we also have Laplace approximations.

1 Like

But since the individual effects are so simple, can’t we just do this specific model without the fancy infrastructure? Same as with what’s happening here: Algebraic sovler problems (Laplace approximation in Stan) ?

Looking at eq. 24 here: https://arxiv.org/pdf/1206.5754.pdf

The things we need to consider are the prior covariance (which is diagonal) and then the Hessian of the likelihood with respect to the parameters we’re Laplacing out. This Hessian is still diagonal, right? So it’s just math with diagonal matrices?

1 Like

Nope.

If you do the maths and y \mid \sim N(Au, \sigma^2I) and u \sim N(u_0, \tau^2I), then you need \log(\det(\tau^{-2}I + \sigma^{-2}A^TA)). Here A is going to be sparse but A^TA is not diagonal.

Also in a lot of multilevel models there is correlation between tthe groups so those matrices aren’t diagonal, and so u \sim N(0,Q_u^{-1}) and you need \log(\det(Q_u + \sigma^{-2}A^TA)), where Q_u is block diagonal and depends on parameters.

Cool that makes sense, but isn’t the A for the individual effects in this specific model going to lead to a diagonal A^TA? It makes sense that this’d not be true for more complicated things though.

Like, if we have three observations across two people, then for this model we can say:

u \sim N(0, \tau^2 I) \\ y_1 \sim N(\alpha_1 + u_1, \sigma^2) \\ y_2 \sim N(\alpha_2 + u_2, \sigma^2) \\ y_3 \sim N(\alpha_3 + u_2, \sigma^2)

where all the non-individual effects and whatnot are crammed into \alpha (we’ll sample those parameters). So in this case

A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix}, A^T A = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}

The entries of A dependent on the covariate values. Au has to be the conditional mean.

If you take the simplest model
y_{ij}\sim N(a_j + b_jx_{ij}, \sigma^2)
a_j\sim N(a, \tau_a^2)
b_j\sim N(b, \tau_b^2)
a,b \sim N(0,1)

Write u=(a,a_j,b,b_j) and \theta = (\sigma,\tau_a, \tau_b).

Then find the precision matrix for u\mid \theta,y and from that you can use the identity
p(\theta\mid y) \propto\frac{p(y,u,\theta)}{p(u\mid\theta,y)}
evaluated at u=0 to see the relevant determinants.

2 Likes

I guess I just decided to ignore the conditioning, whoops.

Which sparse matrix ops do we need for this? Is it just the structured log determinants, \log \det \left(\tau^{-2}\, \textrm{I} + \sigma^{-2} \, A^{\top} A \right) and \log \det \left(Q_u + \sigma^{-2} \, A^{\top} A \right)?

And a sparse linear solve for generated quantities.

Also some plumbing stuff to build block sparse matrices, but that’s not particularly hard

Thanks. I think it’s time to start building these things. We’ve talked about two approaches:

  1. Waiting until we get a proper sparse matrix data structure; that’s a big undertaking if we need a roughly complete set of matrix operations.

  2. Following the approach we took to sparse matrix-vector multiply and just coding CSR or other data structures directly into the arguments.

With (2), we can get going on this right away. With (1), there’s a lot ot be done if we want a roughly complete set of matrix operations with sparse matrices and want them to interact with dense matrices nicely (in things like multiplication). So I’m inclined to take approach (2) to the key functions we need. We can always use that implementation and deprecate the function when we move to proper sparse matrices.

  1. is possible for a Gaussian multilevel model (incl spline components, basis function GPs, ICAR models) with a lot of work on the interface side to coerce all the matrices into the the right form.

I think that for a sparse precision multinormal to be useful, we would need 1) because even the simplest model requires that we add two sparse matrices with different sparsity structures.

But that’s exactly what’s hard with built-in sparse matrices—declaring the sparsity structure. If we just need those log determinants, then we’d pass in \tau, \sigma, A in the first example and Q_u, \sigma, A in the second example. The signatures could be

real log_determinant(real tau, real sigma, int[] iA, int[] jA, real[] valJ);

and

real log_determinant(int[] iQ, int[] jQ, real[] valQ, real sigma, int[] iA, int[] jA, real[] valA);

I don’t have time for this today (sorry) but that’s not quite what you want (or else we end up with a million billion signatures). This is easier through the actual multilevel model spec. Because then it’s dets of double matrices and the partial inverse of double matrices that need to be implemented and that’s an easier task.

There are a few reasons we want to fit this in Stan:

  • lmer doesn’t allow informative priors, hence it can give unstable estimates. This becomes even more of an issue with varying-intercept, varying-slope models where you need to estimate a group-level covariance matrix
  • Relatedly, lmer sometimes crashes or doesn’t run if you start throwing in more batches of varying coefficients.
  • Once model is in Stan, in addition to being able to add priors, we can extend the model in various ways beyond what’s in lmer.
  • Posterior uncertainty is better than noisy point estimate of hyperparameters.
  • Ultimately if we can program it in Stan we might be able to run it faster. For now, I’d be happy to be able to do it in the same time, or not much slower than, lmer, as we’d get the benefits listed above. But speed gains could come too.
2 Likes