Aki and I have put together a fairly detailed proposal for incorporating sparse matrix algebra into Stan. This includes a use case, the algebra, implementation details, and some thoughts about language specification.

This also includes a proposal for speeding up lmer-style models, as well as facilitating approximate inference for glmer-style models (see also this thread for more on that).

We’ve also including some thoughts on how development on GPs could go forward to both speed up the computations and lessen the burden on users (GPs are not particularly easy to work with at the moment!)

sparse_matrix<int rows, int columns, int[] i_vector, int[] j_vector> A;

I suggest that sparse matrices have two phases, a construction and use phase. In the construction phase, you can change the sparsity structure by assigning to coefficients or setting blocks or whatever. In the use phase, you cannot change the structure.

Because of the way Stan is designed, we can’t “tie” a reordering to a matrix and have it persist through
sampling.

Time to revisit this design decision? Sorting is costly. I forget which, but one of the sparse Cholesky libraries allows for precomputation of the best sort order. It would be a pity not to take advantage. Also, lots of sanity checks could be saved if we remembered that things worked fine on the first attempt.

My understanding (from this thread, but also from discussions with Bob) is that this would be a stunningly large re-do of the way Stan works and would require a very good reason. Basically, we have to eat some inconvenience somewhere along the pipeline, and the way things are makes it much easier to do sampling/autodiff.

I think that would need to happen in the transformed data or transformed parameter block. For the former, it’s possible, but for the latter I can’t see a way to ensure that the order of the values is the same each time (or that the sparsity structure doesn’t change). Again, we can’t access the previous sparsity structure at the current iteration to check it’s the same.

I think the suggested interface is minimal - we need to know the shape of the matrix, where the non-zeros are, and what the values are. (I actually didn’t think through an assignment operation, so suggestions are welcome).

The i_vector and j_vector can be easily accessed from teh sparse matrix format in R (and from most other common interfaces).

I suppose, but that’s moving Stan to be an intermediate language that nobody will want to write directly. I guess that’s okay as long as there are higher level languages built on top of it.

Just for perspective, this is one data type useful for fast computation on fairly advanced statistical models. It’s not the end of Stan as a usable language.

Edit: I just checked and this is how you construct sparse matrices in R and Matlab (I,j,value). And, as said before, it satisfies some Stan-specific constraints about keeping the sparisty pattern fixed (and hence storing it as data)

Can we do sparse_matrix<int rows, int columns, int[] i_vector, int[] j_vector> A = values;, where values is a vector?

Another choice — assuming we have tuples before we have sparse matrices — is to form an array of tuples that each have i, j, and a value to go into the i-th row and j-th column of the matrix. Then use that is a set of triplets to fill the sparse matrix.

No, other than reading the Eigen documentation about once a year. In this case it says “The simplest way to create a sparse matrix while guaranteeing good performance is thus to first build a list of so-called triplets, and then convert it to a SparseMatrix.” But it then goes on to say “In some cases, however, slightly higher performance, and lower memory consumption can be reached by directly inserting the non-zeros into the destination matrix.” On the other hand, there are some additional steps that are necessary for the second method to have better performance.

How do you propose to code the phases in the Stan language?

We are. But as Dan responded, this is a foundational assumption in the way Stan’s built that percolates through all the interfaces. Data is persistent and immutable across sampling, anything parameter-related is volatile. Insofar as we can code up things like ordering and whatnot as data, it can persist.

If we can’t, then we need some new thinking on the language side.

For GPUs and MPI the plan is to specify data-only values that are thus guaranteed to persist and can be pinned to CPUs and GPUs once and for all. The sparse matrices issues may be similar, it may just be clunky.

Yes, but it comes at a cost, and in my estimation so far, it’d be even more of a pity to break one of the fundamental encapsulation and immutability invariants on our model class.

Assignment would only work with matching sparsity. Function arguments, being temporaries, would be an exception—they wouldn’t need to be declared for sparsity.

I hope not, too. But it’s not going to be clean and elegant. Sparse matrices never are.

Funnily enough, that’s what people have said about Stan from the get go because of its variable declaration and block restrictiveness and verbosity. Ditto the ODE solvers, which are also super clunky in terms of the way arguments are specified.

In practice, people will use things if the gain outweighs the pain. Things like the ODE solvers, GPs, and sparse matrices are going to set a higher bar in terms of programming than simple GLMs.

We could. It’d look like this:

int M;
int N;
int K; // number non-empty
int is[K];
int js[K];
vectorN] vals;
...
sparse_matrix<M, N, is, js> A = vals;

I don’t like this, though—we don’t allow implicit reshaping for other types, such as assigning a vector to a matrix. I think it ges confusing as to where it would apply and how the rehsaping works. I’d be happier keeping the types simpler. This more verbose form is OK

sparse_matrix<M, N, is, js> A = to_sparse_matrix(M, N, is, js, vals);

where the right-hand side sparse_matrix is a sparse matrix building function. That would probaby entail an extra copy, which is unfortunate.

Definitely not as efficient as you need to calculate all the CRS coefficients to store internally. That’s not so onerous, though, as everything in the CRS structure (presumably what they’re using internally) can be sized ahead of time and computed on the fly. Transposition from CRS is painful, though, so again, it’s just a matter of which operations run at which speed.

I did a csr_matrix_times_vector for a (representation of a) 1000x1000 sparse matrix with 25% ones and 75% zeros multiplied by a vector of 1000 random normal variates as doubles. Doing that 100 times took 24.720 seconds with one core.

I did the “same” thing but using this function body

Eigen::Map<const Eigen::SparseMatrix<T1,Eigen::RowMajor> >
sm(m, n, w.size(), &u[0], &v[0], &w[0]);
return sm * b;

and it took 0.394 seconds. So, I am now going to try to figure out how to circumvent the autodiff when b is in vars, but stuff like this has the potential to make a lot of things faster if we don’t devolve into a multi-year discussion on the best way to implement it.

The whole thing I was hoping to get across in this doc is that it’s not that hard to put this into math. It’s just wrapping eigen’s functions and writing two more (one is easier than the other).

The hard bit is making it usable to “ordinary” users. So all the work is in the stan language. And for the most part that reduces down to “how do we declare this”.

The derivatives themselves even I can figure out but figuring out how to link them into the expression tree is the hard part when each element on the left-hand side is a child of an unknown number of elements of b.

I don’t understand the problem. Giles gives us adjb = A.transpose() * adjAb with a similar expression for adjA. Is there any reason this can’t be programmed up exactly like the autodiff for the dense matrix-vector product?

(EDIT: I forgot the usual warning that I have a limited understanding of AD, so please tell me if I’ve misunderstood the scope of the problem!)

Modulo inevitable typos (I’m not finished yet. Will put in a PR eventually) the chain() method looks like

virtual void chain() {
using Eigen::MatrixXd;
using Eigen::Map;
typedef Eigen::SparseMatrix<double,Eigen::RowMajor>::InnerIterator Inner_iterator;
MatrixXd adjAb(A_rows_, 1);
Map< Eigen::SparseMatrix<double, Eigen::RowMajor> > adjA(m, n, A_nnz,&u_[0],&v_[0],&wd_[0]); //Initialise to A
MatrixXd adjb(A_cols_, b_cols_);
for (size_type i = 0; i < adjAB.size(); ++i)
adjAB(i) = variRefAB_[i]->adj_;
// adjA = adjAB * t(b)
// adjAb[i,j] = adjAB[i]*b[j]
Map<Eigen::SparseMatrix<double, Eigen::RowMajor> > A(m, n, A_nnz,&u_[0],&v_[0],&wd_[0]);
for (size_type i = 0; i < A_rows_; ++i) {
for ( std::pair<Inner_iterator,Inner_iterator> it(const_Inner_iterator it_A(A,i), Inner_iterator it_adjA(adjA,i)); it.first(); ++it.first(),++it.second()) {
it.first().value() = adjAB[it.first().row()]*b[it.first().col()];
}
}
adjB = Map<const Eigen::SparseMatrix<double, Eigen::RowMajor> > Q(m, n, A_nnz,&u_[0],&v_[0],&wd_[0]).transpose()
* adjAB;
size_type i=0;
for (size_type row = 0; i < A_rows_; ++row)
for (Inner_iterator it(adjA,row); it;++it) {
variRefA_[i]->adj_ += it.value();
++i;
}
for (size_type i = 0; i < B_size_; ++i)
variRefB_[i]->adj_ += adjB(i);
}

Can we just implement csr_matrix_times_vector this way now? Or implement a new function that takes the m, n, k, u, v, and w to represent the sparse matrix. It’ll circumvent any language discussion.

This was always doable and will be done once I finish this grant application. (Sorry I didn’t make that clear) But this is the minimal use of sparse matrices. We still need to have the language discussion if we’re going to manipulate them in-language, speed up lmer, or make them user friendly.

You could, but a lot of the time spent in csr_matrix_times_vector is on checking validity of v and u, which don’t change. So, rstanarm needs a version without all that. If there were a way to construct a sparse_matrix in data or transformed data, then in each leapfrog step, you could just check that the matrix was conformable before multiplying by a vector.

w is actually not checked except for its size being the same as that of v. But it checks that every element of v — which is already restricted to be an int[] is in the allowable range.