Sparse matrix roadmap

I love how big our community’s gotten :-)

What’s “TWG”?

The discussions got stalled when we were trying to decide whether to

  1. add a type to the language, or

  2. continue adding simple functions.

There were a lot of proposals for adding the type, and it didn’t seem like a single design would work everywhere.

Now that the language refactor has been merged, it’s going to be possible to consider adding the type seriously—as in a month rather than in a year. We just need a design that includes the technical spec of how data’s being stored and how the I/O is going to work in var_context. @syclik covers all this in his response.

The adj_jacobian_apply may help. One advantage is that the precomputed gradients were usually smaller than brute force autodiff.

I’m also not sure how that return’s supposed to work with sm and b having potentially different scalar types (T1 for sm and T2 for b)—is Eigen’s sparse matrix library up to the mixed types?

Ah, finally figured out it’s a decision-making body, not some matrix API! I’m too impatient to work out all the acronyms.

I want two things for designs this big:

  1. Functional spec. This is what the change looks like to the user. This will be how user-facing sparse matrices will be declared and how their I/O gets handled in our dump format. It will also explain how they’re used—are there functions that operate on them or what? Do they interoperate with dense matrices or do we have conversion functions? How can they be built conveniently?

  2. Technical spec. This is how it gets implemented. For instance, this might just be that we use an Eigen type for sparse matrices the same way we use Eigen types for matrices.

I would like to strongly discourage the direct coding of something this big without getting buy-in on the spec first.

Asks the chair of the technical working group (TWG) 😄

Is there an example? Maybe for Tuples, which I think is the last big feature that is “almost” in. Or MPI?

It is difficult to hit a moving target. (Although that’s basically what this thread will hopefully do - fix the target!)

I still do not know what var_context is. Is there a wiki page? Or an example?

A single design would work. Eigen::SparseMatrix can do most of the hard work for us in the math library. It needs to be column-major storage to be consistent with the rest of Stan (and also to avoid having to constantly overwrite eigen defaults). The reluctance to use a class to actually store anything substantial makes the design easier (although there will be an efficiency hit as we will have to do a lot of redundant calculations).

Casting sparse matrices to dense and dealing with mixed types shouldn’t be a problem. I think eigen can deal with mixed types, but if it can’t we can specialize. (I mean, we’ll have to do all the derivatives anyway, so we might as well do the prim specializations if they’re needed.)

We’d essentially pass 3 vectors of different lengths and 2 scalars to define a spare matrix foo. For the functional spec, there are basically two options:

  1. Details hidden from user: Automatically generate consistent names so declaring sm foo expectsfoo_value__ (double or var vector), foo_inner__, foo_outer__ (int vectors), foo_nrow__, foo_ncol__ (ints) to be present. Obviously it would be nice to bundle these up in the interfaces (so pystan or RStan can just pass a sparse matrix in its internal format). In some sense, this is consistent with how python or R do sparse matrices.
  2. Users need to know how sparse matrices are stored: They are responsible for manipulating these vectors directly and naming them.

A challenging challenge

(This is long and there is a question at the end. The tl;dr is that if the various Stan libraries can’t be content with the information that there is a sparse matrix then we might have some problems. Even needing to know how many non-zero elements there are will be problematic.)

If we have a fully operative sparse matrix type, we might need to change the sparsity structure “on the fly”. For instance, the transformed parameter block might concatenate two sparse matrices, or multiply them (which changes the sparsity structure). Now, the structure (and hence the number of vars stored in foo_value__) can be computed as transformed data as long as we only let people build sparse matrices of parameters from existing sparse matrices of parameters through addition, multiplication or concatenation…

A problem would occur if there was a branching statement where the two branches constructed sparse matrices with different sparsity structures. The below example is extreme (and probably not differentiable) but I can think of situations where this would be continuous.

transformed parameters {
  if (/*parameter dependent condition*/) {
    sm C = A*B;
  } else {
    sm C = A+B;
  }
}

Because the sparsity patterns of A*B and A+B can be very different, there’s no way to know from iteration to iteration how many vars would be in C_value__. If Stan thinks of the sparse matrix as being the vector C_value__ (+ side info) then this is a situation where we don’t know at run time how many parameters are in the model.

Mathematically C is one single object, so if you think of it abstractly we just have a lot of “phantom vars” that are always zero. So it doesn’t break HMC to have the sparsity structure of C change with parameters \theta as long as C(\theta) is a continuous function of \theta.

This suggests to me that a sparse matrix implementation should really be an encapsulated object so that Stan only thinks of it as one object that has a function to access nrow*ncol entries (even though most of these are zero). So a sparse matrix implementation should allow someone interrogating the output of a zero element to just get a chain of zeros.

The easiest way around this problem would be to just not let the user do things like this. But we do need to be able to add or multiply or block together sparse matrices. Otherwise sparse matrices will be very limited. And absent a linter, I can’t see how we can stop people writing that sort of branching code because locally everything is perfectly fine.

Question to @Bob_Carpenter (as language guru) @betanalpha because it’s only the algorithm that should care about this Is it possible for Stan to treat a sparse matrix type as a single atomic thing rather than as an array of scalar types? Because if it can then the stuff I just wrote will be fix-able. If it can’t, someone is going to need to work out what we can and can’t have. (Also @betanalpha because the adaptation will probably have some opinions about this…)

An update for everyone: I have spoken with @betanalpha and the sparsity structure has to be fixed for any sparse matrix that is a parameter or transformed parameter. This is due to some perfectly sensible assumptions that are used extensively in both the algorithm and the I/O (basically that the state can be described as a vector of fixed length).

So what does this mean for sparse matrices? Well it makes them more difficult in the language than anything else (so I’ll tag @Bob_Carpenter here because he might have opinions)

Sparse matrices as data
I see two options for constructing sparse matrices in the data block.

Option 1: From a user interpretability standpoint, the following would probably be the easiest. It specifies a sparse matrix as an (i,j,value) triple.

data {
  int n_row;
  int n_col;
  int nnz;
  int i[nnz];
  int j[nnz];
  double value[nnz];
  sparse_matrix<n_row, n_col, i, j, value> A;
}

or Option 2: This makes more sense from an “internals” standpoint. It would specify the matrix in compressed column storage

data {
  int n_row;
  int n_col;
  int nnz;
  int inner_index[nnz];
  int outer_index[n_col];
  double value[nnz];
  sparse_matrix<n_row, n_col, inner_index, outer_index, value> A;
}

The difference between option 1 and option 2: Option 1 is human readable. Option 2 can be easily coerced in to an Eigen sparse matrix using

Eigen::Map<SparseMatrix<T> > A(n_row, n_col, nnz, outer_index, inner_index, value);

Option 2 is also easily produced by Python and R sparse matrix packages (it’s the default internal storage for sparse matrices in R. I am not sure about python, but it would not be surprising).

A small alternative: We could also do something like sparse_matrix<n_row, n_col, inner_index, outer_index> A(value);

A preferred option

data {
  sparse_matrix_structure A_struct;
  sparse_matrix<A_struct> A;
}

where it is assumed that all of the structure information is passed to the Stan call with derived names like A_struct_nnz__ and A_struct_value__ (general pattern of the derived name is [name of sparse_matrix_structure object]_[internal storage name]__).

This would make passing a sparse matrix exactly like passing a dense matrix, which is the ideal from a user perspective. In particular, the user would never need to know about the internals of sparse matrix storage (which is ideal).

This is a higher burden on the I/O, but it does specify a sparse matrix as a series of integers and vectors with a fixed naming convention, which is the easiest way to flatten them out into a text dump.

This also allows users to perform algebraic manipulations on the sparsity structure directly, which broadly speaking seems like a very good idea!

Sparse matrices as transformed data
Here the sparsity structure can be inferred so something like

data {
  sparse_matrix_structure A_struct;
  sparse_matrix<A_struct> A;
  sparse_matrix_structure B_struct;
  sparse_matrix<B_struct> B;
}

transformed data {
  sparse_matrix C = A*B;
  sparse_matrix D = A + B;
  sparse_matrix E = 2*A;
}

We will also want to manipulate sparse matrix structures directly (they also have an algebra). So things like this would be good.

data {
  sparse_matrix_structure A_struct;
  sparse_matrix_structure B_struct;
  sparse_matrix<A_struct> A;
  sparse_matrix<B_struct> B;
}

transformed data {
  sparse_matrix_structure C_struct = A_struct*B_struct;
  sparse_matrix_structure D_struct = A_struct + B_struct;

  sparse_matrix E = A*B;
  sparse_matrix_structure E_struct = get_structure(E);
}

where get_structure() returns the structure of a sparse matrix.

Sparse matrices as parameters
Here we need to specify everything but the value, which will be mapped to vars in the C++. An example would be

data {
  int n_row;
  int n_col;
  int nnz;
  int inner_index[nnz];
  int outer_index[n_col];
  double value[nnz];
}
parameters {
    sparse_matrix<n_row, n_col, inner_index, outer_index> A;
}

or the ideal alternative:

data {
  sparse_matrix_structure A_struct;
}
parameters {
    sparse_matrix<A_struct> A;
}

where A_struct tells Stan to expect vectors named A_struct_nnz__ and A_struct_inner_index__ exist in the input data file.

Sparse matrices as transformed parameters
This is the tricky one. These sparse matrices must have fixed sparsity structures. So we can have things like

data {
 sparse_matrix_structure A_struct;
 sparse_matrix_structure B_struct;
}

transformed data {
 sparse_matrix_structure product_struct = A_struct*B_struct;
}

parameters {
 sparse_matrix<A_struct> A;
 sparse_matrix<B_struct> B;
}

transformed parameters {
 sparse_matrix<product_struct> C;

 C = A*B; // This is ok!
 C = A + B; // This will throw an error if the sparsity structure of A+B isn't the same as A*B
}

Question: Should we allow declarations like sparse_matrix<A_struct * B_struct> C; to avoid needing the transformed data block?

Sparse Matrices as temporary variables in the model block
There is no requirement to know the sparsity pattern before hand here. So, just like in the transformed data block, we can derive the sparsity structure on the fly. Something like this could work:

data {
  sparse_matrix_structure A_struct;
  sparse_matrix_structure B_struct;
}
parameters {
    sparse_matrix<A_struct> A;
    sparse_matrix<B_struct> B;
}
model {
  sparse_matrix C = A*B; // structure is derived
}

Sparse matrices in generated quantities
Here we need to know the sparsity structure for anything that is outputted. So in that way it’s a lot like the parameters blocks.

But! If there was a matrix that did not have a constant sparsity structure that was used as a temporary matrix, we may need some functional of that matrix to be generated. That means we would need temporary variables in the generated quantities blocks to have a flexible sparsity structure, while the output in the block must have fixed sparsity. An example would be

generated quantities {
  vector[n_col] output_variable;
  {  
    sparse_matrix C = A*B;
    output_variable = diagonal_of_inverse_sparse(C);
  }
}

or alternatively (where the sparse matrix isn’t explicitly constructed)

generated quantities {
  vector[n_col] output_variable;
  output_variable = diagonal_of_inverse_sparse(A*B);
}

This might seem like a esoteric use case but it really isn’t. If the sparse matrix with the inconstant sparsity structure is used as a precision matrix of a Gaussian, the diagonal of its inverse are the marginal variances, which are things that people will typically want.

Did I forget anything?
So I think I accidentally wrote a language spec. What is missing?

1 Like

Specifying a sparse matrix via triplets is not that bad in Eigen
https://eigen.tuxfamily.org/dox/classEigen_1_1SparseMatrix.html#acc35051d698e3973f1de5b9b78dbe345
but we would have to have a way to specify an array of triplets (tuples basically) in Stan.

It’s super-hard to find benchmarking on this, but my understanding was that specifying directly from CSC was faster than specifying by triplet.

(Specifying by triplet can also be efficient if you know all the information needed in the CSC and manipulate the internal storage directly so that you basically enter the triplets in the correct order, but I’m not sure there’s any reason to do that given that CSC is the default internal storage in R and one of the options in numpy).

Right—that’s what I was trying to say above. There are three types of variable declarations in Stan:

  • block variables: constraint + size + shape
  • local variables: size + shape
  • function arguments: shapes

The parameters are block variables. They are assumed to be of fixed size and in a fixed order (that’s actually guaranteed now by how the language restricts size declarations to be data).

Having said that, they can move around in structure by declaring a vector[K] alpha of parameters and then pushing the alpha[k] wherever you need them. But the algorithm is going to treat alpha[k] as a single parameter. So you can’t go permuting them each iteration.

I’d prefer something like your small alternative to Option 2, but keeping to our syntax where stuff between < > is a constraint and sizes are given in brackets, so that we don’t need a different syntax for parameters, function args, etc.

int<lower = 0> M = 5;
int<lower = 0> N;
int<lower = 0> K;
int<lower = 1, upper = M> mm[K];
int<lower = 1, upper = N> nn[K];
sparse_matrix[ii, jj, M, N] x;

I think that’s pretty human readable. Then we could have functions like:

sparse_matrix to_sparse_matrix(int[] ii, int[] jj, int M, int N, vector x);

where all of ii, jj, and x are the same size. The result would be sized as sparse_matrix[ii, jj, M, N]. And presumably we’d allow:

x[i, j] = y;

and throw an error if (i, j) is not valid.

If we had tupels, we could pack ii, jj, M, and N into a tuple<int[], int[], int, int> structure.

And yes, we need something like a way to create a new output structure out of an existing output structure to do assignments. But we should be able to get away without declaring sizes in a compound declare-define.

What you call “temporary variables” are called local variables in all the Stan doc. They are just like block variables, but without constraints.

This would be a very big change in Stan. What do you need it for? The example above could just be written as

generated quantities {
  vector[n_col] output_variable = diagonal_of_inverse_sparse(A * B);
}

Functions like diagonal_of_inverse_sparse do not need their argument or return sizes specified statically.

It would depend on the internal data structure. If the internal data structure is CSC, then specifying by CSC should allow simple memory-based copies that would be optimally efficient. But given that you’re about to perform sparse matrix magic on the result, the copy cost may not be a large fraction of the total cost.

Note: in generated quantities we can change sparsity structure and not worry about var’s since it’s all doubles…

The problem isn’t var, it’s with the way we deal with output.

We can always take a vector and use it to fill different sparsity structures, but each element of the vector will be considered a single parameter by Stan’s posterior analysis tools.

Are you sure this isn’t option 1? Or did you just change the variable name? Because that declaration block looks a lot like option 1 (i,j,value) rather than CSC.

I don’t love this option for a few reasons:

  • I think it’s slower than dealing with the CSC because it’s potentially much further from the internal storage structure (i’ll explain this below)
  • It’s actually hard to get these in R. Typically you’d need something like
// Q is a sparse matrix constructed the ordinary way
Q_ijv = as(Q, "dgTMatrix").
ii = Q_ijv@i
jj = Q_ijv@j
value = Q_ijv@x
  • It makes constructing derived structures a lot harder (and impossible without tuples) compared to a sparse_matrix_structure type that internally wraps down into a few vectors. In particular, you can’t do anything at all without tuples, let alone anything complex like sparse_matrix_structure struct_d = struct_a*struct_b + struct_c; (This is exactly the type of operation people will want to do, so think of these structure manipulations as the most common use case rather than a corner case.)

I didn’t know that. That would be enough!

I wasn’t definite enough there. The internal structure is basically CSC (the non-zero entries are stored in a vector in column major order and there are two indexing vectors). The problem with assigning by triplet is that you can’t guarantee that the insertion will be at the end of the vector. There are internal helper functions that you can use if you can ensure the triplet is in this order that will probably be slightly faster than the map, but in general working from triplets will be slower because they can’t guarantee that new entries aren’t written to a random place in the vector.

Sorry - should’ve been clearer about this.

To summarize
Thanks for your comments Bob. I’m worried that for use cases beyond just declaring sparse matrices, we need a more user-friendly way to manipulate sparsity structures. I’m not sure that the thing I suggested is sensible. Perhaps variadic arguments and tuple return types could allow for things like multiply_struct(structure(A), structure(B)), but it would still make compound statements confusing to implement.

I agree with you there! I’ve just never seen one. The only things I’ve seen are either the triples or something like CSC. Did you have something else in mind?

Not sure how I messed that up, but you’re right.

I proposed one in the post above that defines an algebra over sparsity structures, so struct_A + struct_B would give a structure object that encoded the sparsity structure of A+B

I got that, but how does it get off the ground and how do you do I/O? I thought it was a matter of specifying (i, j, value) triples or something closer to the internal representation.

I proposed a hack that did it through automatically generated internal variable names (from our friendly interfaces) or manually constructed variable names (through cmdstan). Obviously tuples would makes this cleaner. (Also whatever IO solution is being used for tuples would work)

Is it perhaps possible to use decltype(A + B) to get the sparsity structure?

In C++ or in Stan? I’m reading Bob’s comments as being about the Stan language.

In C++, I would probably just build a matrix where every non-zero entry is 1.0 for computing the structures.

I was thinking C++ but hoping that Stan could use it to compute the structure.