Speeding up hierarchical models

It’s the determinant if a sparse matrix so it’s relatively cheap. The dimension is the number of parameters being marginalized rather than the amount of data. Reordering algorithms are automatic. It’s usually worth doing. It fixes all geometry problems. So it’s decent even in low dimensions.

We can have some zoom meetings on implementation options if you want.

A different implementation is the one done by Simon wood in mgcv, which doesn’t use sparsity. See this and it’s follow ups https://people.bath.ac.uk/man54/SAMBa/ITTs/ITT2/EDF/REMLWood2009.pdf

It’s possible that that might be more convenient for brms but it doesn’t scale as well. And it might be better aligned with GPUs? Genuinely not sure about sparse stuff on GPUs.

Anyway. If we’re really heading this way there will probably end up being multiple routes (like the autodiff for the cholesky)

1 Like

Ah, thanks!

However, in instances where the matrix is treated as a whole object such that underlying subsets of the matrix are never assigned to, we can store one autodiff object representing the entire matrix.

I’ll go comment on this later. I like this feature.

I think that is better than doing it automatically.

The issue with stacks is that in each model evaluation you can get a different number of function calls in a different order. Same thing comes up with the optimizer and will happen with the Laplace approximation stuff (we want to use the solution from the previous leapfrog step to initialize this one). I don’t know a solution.

Oh whoops, I’m behind the times. That’s good news lol. That implementation looks nice!

Is the reordering always discussed in terms of the matrix factorization? I was hoping for one in terms of the hierarchical groupings.

I skimmed but I’ll have to spend more time to see how they’re building their matrices.

Yeah I’m pretty convinced we also need to do a sparse matrix GLM. That will be straightforward once the types are there.

I still like having the separate syntax here because it lets these models be written in Stan without leaning on design matrix libraries in R/Python (and dealing with all the packing and unpacking of parameters).

No. It’s for fill reducing in the algorithm. There’s no reason to expect that groups would be kept together. But it’s deterministic (as a function of the connectivigy graph of the matrix)

Same way. They need the (negative twice) log prior to be \beta^TS\beta, which ususally puts S as block diagonal. But the algorithm does’t really care about that (it deals with the posterior precision X^TWX + S, which is usually not structured).

Then you should be looking at the types of models supported in BRMS, because they’re quite a bit more complex than this (especially with their links between the prior and the posterior). The code should be is the same for basically any type of random effect (structured or unstructured)

1 Like

I thought some more about this. Gonna rewrite some of this, but I just wanted to leave a couple notes here before this goes stale.

Having X * beta in a Stan model isn’t bad in terms of Stan code, but it usually means we’ve had to do some painful around-Stan code.

To generate that X, you’re usually using some sort of external design matrix thing, which means that design matrix thing decides your variable naming and ordering and you’ve got to mold all your input/output around that.

But for any of the non-hierarchical groups, we could think of them like hierarchical parameters with one group, so the syntax:

hierarchy(x0, beta0)

Makes sense (it would just be x0 * beta0, or x_{0} \beta_0). I don’t think we lose efficiency much by not representing these things as matrix-vector multiplies (we would lose some). They can still be recognized as such behind the screen.

And similarly the call:

hierarchy({ is }, alpha, 1)

Represents the term alpha[is] or \alpha_{i[n]} which is useful regardless of whether or not \alpha is actually modeled hierarchically (this commonly wouldn’t be hierarchical for two-valued groups).

So hierarchy is the wrong name for this, because it is really a shorthand for writing the linear term of a model, and it really isn’t looking at the prior at all.

To these are two directions. One to make it easier to write efficient hierarchical models in Stan with some sort of weird syntax where we can get around using design matrices, and another where we have glms that support separate dense and sparse design matrices.

This is bringing back nightmares of when Matt and I first started and we tried to enumerate the ways we could code up hierarchical models.

But what you’re suggesting is similar to what TensorFlow Probability folks are doing with with their multiple log density evals in parallel for different chains. To get around the waiting on U-turns, I think they run the chains asynchronously and only keep the log density calcs synched.

Sparse matrix support for big GLMs is part of the motivation for sparse matrices.

But if we do know that they’re static data variables, we should be able to just store a reference. And by “static data” here I mean pointers to transformed data or data variables. We couldn’t quite use expressions without evaluating and storing the result as the reference wouldn’t exist.

Yeah.

The thing that gets me about this is sparse matrices is it also lines up really nicely with all the math in the lme4 paper and lets us integrate out latent variables if the output is normal, but actually writing:

y ~ normal(X * beta + Z * gamma, sigma)

or whatever is just not great in plain Stan. You’re depending on non-Stan software to build your design matrices, which means you also need additional software to interpret your results.

Mmm. It seems less than ideal to do so much software engineering to sync chains just to make matrix-vector operations faster, but 1 core vs. 6 cores being so competitive is something.

If that’s the bottleneck, that’s where you have to go. How much engineering it’s worth depends on whether you’re doing it as a research project or something for users. If you’re doing it for users, it then depends how much time you’ll save them vs. how much time it takes to code. And of course how fast your software engineers are :-)