Efficient static_matrix type?

autodiff
matrix

#1

There’s been ongoing demand for more efficient matrix structures for reverse-mode autodiff in Stan. I’ve been thinking about this for years and @rtrangucci and @Daniel_Simpson have renewed the plea.

Recap of where we’re at

To recap, the current matrix type in reverse-mode is Matrix<var, -1, -1>. This is an Eigen structure that holds an array of M x N values with in column-major order along with the row size and column size. This makes it easy to get and set entries in matrices. It also makes it very hard to grab the underlying double matrix corresponding to the values of the variables and the underlying double matrix corresponding to the adjoints, which we need for efficient matrix autodiff. As is, we wind up copying to and from those structures with every operation. That’s terribly inefficient and we’d like to do better.

The matrix structure we need for autodiff

The fundamental obstacle is that we want to have a structure like this:

struct matrix_vari {
  double* val_;  // use Map<MatrixXd>(val_) to use as matrix
  double* adj_;
};

and then our usual pointer-to-implementation

struct matrix_var {
  matrix_vari* vi_;
};

and we’d manage all the memory in our arena like other autodiff variables. Easy, right? Yes, but how do we get and set values? The getting is actually pretty easy. Setting is pretty much impossible with this structure.

Setting and getting in reverse-mode autodiff

The design of reverse-mode autodiff needs to ensure that whatever is pointed to at one point doesn’t change values. If you want some new type, you create a new node in the expression graph. So even when you see:

var x = 5;
x = x^2;

what is going on is that the first statement declares and creates a a vari for the constant 5 (i.e., a new “independent variable”). At this point, the var instance x points to this vari. Then when we execute the second statement, we create a brand new vari for the square of the original vari. This new vari is then the value of the pointer in x. But the original vari never changes—it’s always the constant 5.

Getting

I think the getting is doable with a little extra memory overhead that essentially creates a new vari for every get operation with a unit partial linked back to the original matrix. That maintains integrity of the vari everywhere with a bit of overhead.

Setting with the new matrix structure

We’d like to be able to do this:

matrix_var x(2, 3);
...
x[1, 1] = 5;  // ???

But if we want to maintain the integrity of our original matrix, then we need to create an entire copy. That’s a ridiculous amount of overhead to do things like create an ad-hoc matrix (like for GPs).

A possible solution

How about a static matrix type that simply doesn’t allow setting? We can have a function to create one from an ordinary matrix or array:

matrix[4, 5] a;
for (i in 1:4) for (j in 1:5) a[i, j] = ...;
fixed_matrix[4, 5] a_st = a;

Then there’s a single copy and all subsequent matrix operations with a_st are going to be efficient and local. We could declare parameters or data to be static_matrix. And we could have things like our covariance functions for GPs or our Cholesky factorizations return static matrices.

For cases like this, it’d probably be a little less efficient than what we have now which does the fixed matrix construction once internally and wouldn’t require any autodiff redirection. The fixed_matrix is going to need to store pointers to all of the original vari out of which it was constructed and then push autodiff along the links with unit partials.


A proposal for sparse matrices and GPs in Stan
#2

This is neat, when Justin and I were talking about the autodiff we thought the original approach to avoiding copying was pretty clever. I think I understand but let me check: Does the efficiency problem you’re trying to avoid come from not having the doubles from the ‘vari’ objects inside the matrix all next to each other in memory?


#3

But if we want to maintain the integrity of our original matrix, then we need to create an entire copy

Oh dang, that is a problem. Wouldn’t have guessed.

Couldn’t we just call this a const keyword?

matrix[4, 5] a;
for (i in 1:4) for (j in 1:5) a[i, j] = ...;
fixed_matrix[4, 5] a_st = a;

I take it this means we’re gonna be able to declare variables after code now without having to do weird blocks in blocks in blocks :D.


#4

My (naive) thought was that stan_matrix would be a templated class, where the specialisation would be a static matrix and the var version would be a standard matrix.

Having some sort of fixed_matrix class is great, but what’s really needed is specialisations of the Stan functions to specify which parts of hte matrix vary and which don’t. Simple example is a covariance matrix of a GP with a fixed length scale, which is only a var matrix because it’s a fixed matrix multipled by a var. To my “I haven’t checked” knowledge, the standard way of specifyting this is with a multi_normal_chol, which is faster than having to compute the chol every time, but is still O(n^2) vars instead of 1…


#5

For sparse matrices, you can do a little more. For example if a matrix is specified as (innerIndex,outerIndex,values) [Bob just loves camelCase, I’m sure], then if the matrix is symmetric positive defintite and needs a cholesky, you can carry around both the reordering and the “symbolic factorisation” (basically, the precomputing for the cholesky). This is not a bad idea. But it would need a sparse_spd type, which might be a bit challenging to deal with (especially if you want to strictly enforce SPD at every step, rather than just “wait for the cholesky to fail”).


#6

I think what you are referring to is the need for a covariance matrix parent class more than what Bob is describing for a fixed_matrix type. For the reasons you mentioned and others, I considered this back in the early days of Stan but gave up. It was difficult and a lot of the savings one could imagine getting were not going to be realized because Stan destroys its containers (except those defined in data and transformed data) after every leapfrog step.


#7

Just to double check (I have never looked at the inference engine code). Does this mean that at each leapfrog step, that requires a type container_type foo; the Stan code constructs a brand new container_type foo(constructor_args);?


#8

For things in parameters, transformed parameters, and model. One might think the containers persist until the sampling is finished and it is just their values that get updated for each leapfrog step, but alas, no. So, that made it less attractive to think about a container that would, for example, have a constant SPD matrix and an unknown positive scalar var, so that you could Cholesky factor the constant part once and the PDFs could be specialized to factor the unknown scalar out and minimize the number of nodes in the AD expression graph. Maybe it is more possible to refactor things along these lines today than it was in 2011, but I have not heard Bob imply anything either way.


#9

yeah - that makes sense. I’m in favour of abstracting away things that can be abstracted away. But if this is best dealt with by updating the “efficient Stan” section, then definitely that’s a good idea. Maybe @betanalpha can squeeze it into his GP case studies.


#10

To be clear, there are a few reasons for the lack of persistence.

One is abstraction – the only thing that is maintained from iteration to iteration is a point in the unconstrained space with no structure whatsoever as the algorithms don’t need to know anything about that structure. The containers are recreated because the model has to take this unstructured vector of values and format it into the relevant containers. Conceivably the model concept could be modified to persist these containers, but there would still be copying involved when computing log prob.

The other issue is that loops in the language allow temporary containers to have variable size and hence the size of the expression graph is not constant from iteration to iteration. Trying to persistent substructure of an expression graph is, from my understanding, a very hard problem with only very heuristic-driven solutions.

Consequently we have left the optimization of these issues to the user, with the expressiveness of the language helping out. For example, having lpdfs that are heavily templated to be able to recognize and utilize certain structure from how they are called.

Another consideration is how specialized the lpdfs become. If you create a temporary matrix that depends on a var then you need all N^{2} vars because the language doesn’t know where and how the matrix elements will be used. If they will only be used in an lpdf then I think that we should develop an lpdf that enforces this guarantee by taking in the raw values and computing the gradient analytically to avoid a large intermediate matrix of vars (and also never making it available to the user). This was the motivation for the policy for adding new GP functions, https://github.com/stan-dev/stan/wiki/Adding-a-Gaussian-Process-Covariance-Function.

Other examples include ODE output (to avoid computing the full Jacobian if we end up using the outputs in only a quadratic form) and potentially the algebraic solver output. These cases can all be specialized to special lpdfs (nominally Gaussian specializations) to achieve the possible speed up.

Following this logic we could also consider specialized types as @Bob_Carpenter suggests. For example, we could have matrix types that say don’t allow access of individual elements but then require specializing all of the matrix operations.


#11

Creating new containers and copying is probably not a measureable bottleneck.

There’s no way to predict exactly which containers of which sizes will be needed each iteration other than for the parameters. But the parameters need to be calculated each iteration from a basic unconstrained parameter vector, which dominates the cost of actually constructing a new container. This is essentially what Michael is saying here:

Next up,…

Nothing important has changed. We are looking at ways to push data off once, but not at a way to maintain containers of parameters.


#12

This abstraction seems wrong. As soon as you have structured data types (e.g. A matrix) you need container info (in that type dimensions) to evaluate the log density and it’s derivatives. Current structure hacks around this by putting that info in data, but that’s awkward for more complex containers.


#13

All the algorithms need is a state and a way to evaluate the log density and its gradient. They don’t need to know anything about how that state maps into structured containers, and having a single monolithic vector makes implementing the algorithms all the more simpler.


#14

Yes. At an efficiency cost.


#15

An example here might help. Here’s one where the “helper” information needed depends on the context in which an object is used.

If you have a sparse matrix and you want to do a Cholesky decomposition, you need to reorder the rows and columns. Because Stan needs them to be in the same order every time, this must be computed once and preserved.

But until you ask for a Cholesky, there’s no way for the program to know that it needs a reordering.

This can be worked around by either requiring users to know a lot about sparse linear algebra (not optimal) or by having another type in the Stan language and not letting general Stan language sparse matrices be used as arguments to anything that calls a Cholesky (messy).

Edit to say maybe the third way isn’t as messy as I’d thought. But it would be messy if using the Stan math library for something else (in which case you’d have to keep track of the reorderings manually, which is different from every other package I know of that does sparse linear algebra). But this is not the main use for the library, so a bit of inefficiency there to make it play nicer with the samplers is maybe not horrible.


#16

How expensive is the row and column reordering that you think it needs to be stored? Not just in absolute terms, but compared to the full Cholesky decomposition and resulting autodiff?

The know linear algebra bit would be in making the right declaration so it’s only declared once or sorting the parameters the right way so it doesn’t need reordering?

I’m not sure how we could store things iteration to iteration. It will break the clean immutability of the model class, which I’d rather not do unless absolutely necessary.


#17

Not expensive. Storing the symbolic factorisation is more of a saving. If you know to do the reordering (for example if you can only cholesky a special type of sparse matrix that is square and has been reordered) then everything is fine. On output, you’d probably want to revert the matrix to the original order, so whatever dumps a state to a file (or stream) should know about that.

If the reordering algorithm is deterministic (and honestly, I don’t know the answer to this. It would be implementation dependent probably) AND we’re very careful not to change the sparsity structure from iteration to iteration (this can happen automatically if the value of a particular entry happens to be 0.0 only in one iteration), then we could probably put this inside the Cholesky decomposition. But I’m worried that that would be more prone to subtle bugs than just computing the reordering once.


#18

If the sparsity can change, that’s a good motivation for computing just once.

But my bigger concern is that if zeros get dropped, then autodiff won’t work because the terms matter even if the values are zero.


#19

I wrote 3 ways that should keep the information statically without having to change anything deep in the Stan architecture in the sparse matrix spec. Of those, 1 and 3 are better than 2 (which one’s the best depends on how you balance user-friendliness vs flexibility. I prefer option 3). So you can (if you’re being careful) get around all these issues without needing persistent containers.