I’m trying to figure out what the use cases are for sparse
matrices (not ragged data structures, not structured matrices
like triangular or tri-diagonal or anything else) before trying
to implement them.

So far, the only use case I can think of is to have a sparse
data matrix X, a dense parameter matrix beta, and compute
X * beta. That’s what we already have hacked in with CSR.

What else are sparse matrices good for?

The problem I’m running into is that for parameters, transformed
parameters, and generated quantities, we need to fix the sparsity
pattern once and for all because Stan requires a fixed set of
parameters for I/O. Are there cases where local variables
or transformed data have varying sparsity over their scope?
Function arguments should be no problem, because those don’t take sizes.

I’m imagining by analogy to matrices, e.g.,

matrix[4, 3] x;

I’d implement a sparse matrix the same way:

sparse_matrix[4, 3] x;

But that won’t work for parameters because it doesn’t fix the
sparsity pattern. Instead, I could take an array of non-zero
indexes

int nz[J, 2];

that indicates J non-zero positions, then declare a sparse
matrix like:

sparse_matrix[M, N, nz];

to say that it’s M x N and the non-zero entries are as given in nz.
Then we’d throw exceptions if anything is assigned to the
zero positions or if any of the nz[j] are outside M x N.
But that’d prohibit doing things like adding or multiplying sparse
matrices unless we could compute the sparsity pattern ahead of time
for the declaration.

I doubt there are cases (that are amenable to HMC) where the sparseness pattern is not fixed. But there are cases where the sparsity pattern is not obvious, such as the matrix product of two sparse matrices. For the data block and parameters blocks, it seems like we have to specify the sparsity pattern, but for the other blocks it would be great if we could just let Eigen figure it out.

Dan Simpson would be a perfect person to tell about beauty of sparse matrices for Markov Random Fields etc., but it seems that he has no useful internet connection at Mongolia, so we have to wait him to get back.

Let’s resurrect this thread (because apparently I think about this every eight months).

Use cases and where we’re at
Now that @mitzimorris’s got the basic ICAR models working the first use case would be to re-implement the sparse inner product as a function native in Stan to reduce the size of the expression tree.

The next step would be to implement a multi_normal_prec that takes a sparse precision matrix. This matrix would, in general, depend on a parameter, so it would need a sparse specialisation of the appropriate reverse mode functions.

My job for this week is to write something similar (actually the helper function for forward mode) so that we can lose an INLA dependency in the ICAR case study. So I’m happy to have a shot at these.

This would allow @imadmali to implement the Leroux model (not La Roux as I typed before I caught myself). It will also let Wakefield do a bunch of mmodels that he’s currently hand-rolling MCMC for.

The precision matrices in these cases look like Q = \sum_{i=1}^k \phi_i Q_i for fixed sparse matrices Q_i and potentially unknown parameters \phi_i.

Some interesting challenges

(I’m going to assume through all of this that, while it will be exposed, the initial users will be rstanarm, as you’ll see below that these might be a beast

From previous conversations, and now Mitzi’s experience, we can implement this without a specialised type by passing some vectors that define the sparse matrix. It’s not pretty, but it might be the easiest place to start. This would be similar to replacing the matrix type with 1 array and 2 (fixed) dimension parameters

The only difference between dense and sparse matrices is that typically you reorder sparse matrices rather than use them in their given order. This is because it reduces the computational cost of the algorithms massively. This means that the cholesky decomposition Q = LL’ no longer has lower-triangular matrix L. Instead L is a permutation of a lower triangular matrix. Obviously we need to know this permutation to do solves etc. Thankfully this can be fixed throughout the problem, so maybe a function in the transformed data block to compute the permutation from the sparsity pattern would would be enough to start with?

If we make a proper sparse matrix class, we can do a few optimisation (as well as making a lot more pleasing syntactically). In particular, we can do the reordering in place and hold onto a symbolic factorisation, which is the first phase of a sparse cholesky that’s common to any cholesky factorisation with the same sparsity structure. Given that we’ll be doing a few thousand of these suckers, the saving will be real.

If you dig into Eigen’s SimplicialCholeskyBase you can see that they’re storing the pattern and the elimination tree (ie the symbolic factorisation). As these things are not parameter dependent, it seems to me that they should be kept. Does this fit with the logic of Stan? (ie you instantiate a sparseMatrix and it computes these two things and keeps them until the object is destroyed?)

If you want to talk further, let me know. I’m also always happy to go south of the border to have a chat…

Are you suggesting implementing a different multi_normal_prec for each form of parametric sparse precision matrix? Otherwise, wouldn’t we just autodiff through the parametric specification?

That sounds super cool. Ben was urging us in the past to keep the factorization of a matrix cached, but we never figured out how to do that. The entire function/object design is stateless. The problem is that if you let users construct that matrix, how do you update the cached solutions as the values change? You’d need something stateful that kept track of whether the factorization is up to date. Not impossible. Especially if the sparse matrix class has a fixed sparsity pattern known at compile time. We’d need something like that for parameters anyway, but not in general.

Have you looked at how Eigen implements its SparseMatrix class? (Or possibly its SimplicicalLDLT class). It keeps an indicator of whether or not the sparse matrix has a reordering, symbolic factorisation, and numerical factorisation as well as the appropriate things.

The operations for the normal lpdf are the sparse specialisation of the inner product x’Qx and the log determinant sum_i(log(L[i,i])). So, unless I’m missing my understanding, we need a reverse mode autodiff of the sparse cholesky and the sparse inner product for this to work.

I think that’s right. But that only autodiffs the normal, not through any parametric construction of the precision matrix. I think we’re on the same page.

Yes I think so. The big challenge here is the implementation of the autodiff for the cholesky. It’s easy to do inefficiently (actually you can just use a variant of the old Smith/Giles code Rob implemented). It’s more delicate to actually do it properly. The engineering challenge is that it’s random access to L[i,j] is very expensive, so you’ve got to iterate through carefully. This is already done for computing the Cholesky in the SimplicalLDLT_impl class, so all the bits needed to do it are there, it’s just taking the time to understand them and put them together.

We can do y = Qx for sparse Q and treat y as dense. At that point you can do x'y to get the value and you have 2y as the derivatives with respect to x. So, I am not sure about the need for a sparse vector product, but yeah we would have to do some sparse factorization of Q to get its log determinant.

So @imadmali tried this out last week. It was slow. @rtrangucci suggested that it was a combination of:

The copy that occurs when the transpose happens

The lack of specialised autodiff for the matrix-vector product.

The second point can be fixed (I mean, it’s pretty easy!), but the first is always there, i.e. dot_product(x,y) is always faster than x' * y. So there’s a solid case for some quadratic form code.

Ok. There’s one more thing that I didn’t think of about this. The basic problem is that at compile time, we don’t have easy access to the number of non-zero elements in the Cholesky factorisation and hence we know that we need a vector of vars but we don’t know how long that vector will be.

I don’t understand all of this enough to know if it will be a problem. Will we be saved by the fact that this vector of vars represents one node on the expression tree, or do we still need to know how many there are? Any thoughts @Bob_Carpenter?

If the latter is true, it’s easy enough to compute the number of vars needed using a separate program. Would this be ok?

Yes. That will be much faster (and is basically what I want to push through the language). Mainly because sparse matrix-vector product, and sparse quadratic form should be easy enough for even me to do the whole thing on relatively quickly.

Then we’re just a sparse cholesky (and an internal type) away from world domination.

@anon75146577 Yeah. It went something like pairwise difference was faster than the Cholesky decomposition version dot_self(csr(L*phi)) - e * dot_self(phi) which was moderately faster than dot_product(phi, csr(Q,phi)).

What function do you want specifically? A sparse version of quad_form(y, Sigma) = y' * Sigma * y?

Yes, this will be a problem. Every Stan variable other than function arguments is declared with a static size. This is one of the serious issues with sparse arithmetic and the current Stan design.

Then if it’s so critical, we should work on speeding it up. Is it slow in the way that all sparse matrix arithmetic is slow compared to dense arithmetic, or is it written in such a way that we could optimize somehow?

If there’s one in Eigen and you can figure out how to pass it arguments using our current data types, it should be relatively easy to add. It’ll be faster if there’s a decent algorithm for the gradients.

This isn’t hard to do. Ben’s got partial code in the other thread. Passing it to eigen and actually coding a reverse autodiff ffor it isn’t hard and should make a difference.

If we’re happy with building row-major sparse matrices (not the Eigen default) it should be possible without changing the function signature.

The problem is that the number of non-zeros in the cholesky factor (ie the number of vars) is not easy to get at compile time (you have to write a different program to compute it). But once it’s computed it’s fixed.

This is only a problem for the Cholesky.

One possible way though this might be to code up the derivatives of log_det(sparse)' andsolve( chol(Q)’, vector)’ directly, which might be possible without computing the intimidate derivatives. (Although I’m not sure).

In that case, we’d just not specialise chol(Q) to vars.

OK, that may not be so bad. When I said all the sizes need to be declared statically, they won’t actually be instantiated to actual numbers until you see the data. So if that sizing calculation can happen with a Stan function call, it should be possible.