Should sparse matrices in Stan be row major or column major


#21

I don’t think so.

I did profile it and Bob and I got as far as figuring out it was all addition/multiplication taking all the time and left it at that… of course then there were a bunch of code changes during review and I’m not sure if hte final timing was the same. I’m looking forward to seeing what you can get out of it.

That’s interesting, I think we just got that for free b/c templating magic (I mean we noticed, but never exercised it outside of tests). It should be “straightforward”… :P


#22

“Straightforward is the enemy of good” :p


#23

Yeah, for me straightforward means I have to carve out a chunk of time with pencil and paper.


#24

To me, straightforward = hubris

(thinking back on all the things i’ve thought of as straightforward since i started thinking about this)


#25

Not so much a grey area as different representations are better for different operations in both sparse and dense matrices. If you want to do A * B for dense matrices, it’s best to have A in row-major and B in column-major order. It’s usually worth transposing A once to do that—Eigen figures that out with metaprogramming and size traits.

These are huge differences for dense matrices.

No. The branches do so much one thing wouldn’t suffice. There are some upstream regression performance tests, but those only help if they use your operations. They’re mainly there to catch glaring errors in the samplers or I/O from sneaking through.

This is true in general in Stan now. Every continuous argument needs derivatives, but can have constans supplied. ODE solver and algebraic equation solver are a bit different in that they have some data-only arguments.

Straightforward is what you think things are after you’ve understood the textbooks but before you’ve actually tried to put them into practice in the real world.


#26

Good point, I wasn’t thinking of memory locality, just that there were extra calculations need if you had the wrong sparse structure in addition to any memory locality. Apparently “grey area” means “I haven’t thought it through” :p


#27

Probably need more than just csr and csc matrices as well. Constructing matrices in both of those formats is horrendously slow (cause random access is so hard in both). I think the easiest thing to do is have another matrix format that is basically a hashmap, so you build matrices with that, bake them into either a csc or csr matrix, and do whatever you want. (edit: this is what Scipy does – not that their sparse library is great)

I’m guessing the biggest candidate for the sparse matrices (other than regressions, where they’re really nice) are GPs. If your covariance matrix is parameterized, you’ll have to rebuild it at each HMC step, so that rebuild’ll have to be fast.

I’m guessing the underlying linear algebra library is gonna be Eigen? Are there any other options?

Here are some other questions to throw on the heap:

  1. Should matrix multiplication be aware of sparse matrices? If A and B are appropriately sized sparse matrices, should A * B work? Or should it all go through special functions? Can A and B be different storage types?

  2. How does this transfer to the fancier algorithms? Sparse eigensolves are a lot different than dense eigensolves? Sparse solves are different than dense solves. Are there separate functions for each storage format?

  3. I guess both of those previous questions boil down to, are the sparse matrix formats extra Stan types (like csr_matrix or csc_matrix)? Or do they look like regular Stan matrices with flags attached? Or are they raw array formats (like the current sparse stuff)?

Does anyone know of a great sparse linear algebra library to look up to for this? The sparse stuff I’ve seen always seems really scattered. Scipy’s implementation is a case in point: https://docs.scipy.org/doc/scipy/reference/sparse.html . Using Scipy sparse feels like you’re writing in C. Is there a nice R one or anything?

I think MATLAB’s is a little more coherent. They try to, as much as possible, make it look like dense stuff, but I dunno how well it succeeds. I’m not really a big MATLAB user.

Hopefully I haven’t said anything too wrong, my experience here is sparse.


#28

So the sparsity pattern will need to be fixed which should let you pre-calculate the pattern and do writes that are continuous in memory but definitely A+ on providing a simple format that’s in use elsewhere for constructing stuff. We’ll need to write interface-specific IO so we can go from the R/Python native formats into ours.

We do eigen for everything else so unless there are really serious issues that’s it.

This is why we haven’t tackled it before: it means re-writing everything to work with sparse or allowing things to go through some sort of (slow) adapter but long-term I think the goal should be for it to work transparently (so A * B should just work).

That’s completely unnecessary with C++ so we should just make it transparent (and fail to compile until it is so people don’t write code that fails at runtime.

Well, internally everything is easier if they are separate types and externally we want people to be aware of what they’re doing so I think they should be a separate type in the Stan language as well. Otherwise there’s too much potential for confusion during the (long) transition while operations work only for some functions.

To echo myself let’s go with Eigen unless there’s something magically better just b/c we don’t want more DevOps/installation issues. Long before we get there we probably need to flesh out interface/io more.


#29

From @rtrangucci:

Row major may be important in eigen going forward.

https://eigen.tuxfamily.org/dox/TopicMultiThreading.html


#30

So the sparsity pattern will need to be fixed which should let you pre-calculate the pattern and do writes that are continuous in memory

I could see this working really nicely. How I do sparse design matrices now is I pass them in as big dense ones and then sparsify them in transformed data. I like it.

Well, internally everything is easier if they are separate types and externally we want people to be aware of what they’re doing so I think they should be a separate type in the Stan language as well.

I guess the other issue here is that there are lot of different, useful sparse matrix types that are at a higher level than the storage formats. Banded matrices vs. blocked in some way etc. etc. Personally I don’t like the cov_matrix type right now. It’s weird, and it makes me nervous I’m gonna mess up my performance (I think there’s a thread around here that Stan checks to make sure these matrices stay positive definite, which would be really hard). I like errors about Math assumptions being thrown at the solver level where possible.

I also don’t like the idea of a matrix type sometimes magically being sparse. Maybe I’m being inconsistent here.

I like:

csr_matrix, csc_matrix, matrix, etc.

I do not like:

banded_matrix, symmetric_matrix, etc.

To echo myself let’s go with Eigen

Seems like a fair plan. I never messed with that library till I ran into Stan, but I really like the dense stuff.

Long before we get there we probably need to flesh out interface

My gut feeling is that people mostly fail at sparse interfaces. So I think copying someone else’s design is best. Eigen seems cool, but I haven’t worked with their Sparse stuff. I could make a point of familiarizing myself with it though if that’d be valuable.


#31

Yeah, I guess that’s the result of the dominance of matrix * vector in sparse algorithms.

Is there anything outside of GPs and regressions though of interest to Stan? Or are these the two big fish?

Are the solves in multi_normal GPs still done with decompositions? Like a complete sparse cholesky? Or is it a preconditioner + a regular iterative solve?


#32
  • Eigen has a perfectly workable sparse matrix interface for what we’re likely to use (it doesn’t do parallelism very well, but neither does Stan).

  • I don’t think the user should know anything about the storage format (except for there being a few different constructors), so it should be sparse_matrix not csr_matrix

  • The function csr_matrix_times_vector should be depreciated in favour of this

  • Any operation that only affects the values (but not the i,j positions) should be allowed. That’s basically addition, scalar multiplication, and diagonal multiplication

  • Sparse-sparse multiplication is anathema to the Stan paradigm (because it would change the number of vars needed as the sparsity pattern changes)

  • banded_matrix would be useful for ARMA-type models

  • sparse_symmetric_matrix and sparse_symmetric_positive_definite_matrix types would be very useful for carrying around certain operations

  • I’d argue the latter is necessary because we need to know if someone is able to run a cholesky factor on it at compile time (to work out the number of vars that would be needed). I do not think that you should check positive definiteness anywhere except where it’s critically used (eg in a multi_normal call). In these cases, you’re already doing the expensive operation.

  • I think the declaration would look something like

template class sparse_matrix : public Eigen::SparseMatrix<T1, Eigen::RowMajor> {

private:
bool is_symmetric;
bool is_postive_definite;
// a bunch of things keeping reorderings, symbolic factorisations, etc
}


#33

I don’t think the user should know anything about the storage format (except for there being a few different constructors), so it should be sparse_matrix not csr_matrix

I’ll just leave it at I don’t agree. Maybe I’ll change my mind (this could be a Céline song, but I Google’d it, and alas, it’s not). but I’m not convinced of the advantages. There’s still the issue of building the matrices, as random access for a generic sparse_matrix class is probably complicated unless I’m missing some fancy trick.

I’d argue the latter is necessary because we need to know if someone is able to run a cholesky factor on it at compile time (to work out the number of vars that would be needed).

Oh I see. If we’re actually talking about a matrix of random variables defined in the parameters block, one of them would have many fewer parameters of the other.

The number of vars is a runtime thing though. The autodiff graph is build dynamically, so I don’t think this factors in here.

From the Stan manual:

For instance, a variable declared to be real<lower=0,upper=1> could be assigned to a variable declared as real and vice-versa. Similarly, a variable declared as matrix[3, 3] may be assigned to a variable declared as cov_matrix[3] or cholesky_factor_cov[3], and vice-versa. Checks are carried out at the end of each relevant block of statements to ensure constraints are enforced.

I’m only into checking simple constraints (like symmetry). I’m against the positive definiteness checks. This is why I’d be against those custom types, because as Stan currently does things, it checks (I think), and I don’t like that in my models.

I could go for fancy banded_matrix/symmetric_matrix types and such if it meant avoiding functions like multi_normal_banded and multi_normal_symmetric or banded_matrix_times_vector and such. But these checks!

Sparse-sparse multiplication is anathema to the Stan paradigm (because it would change the number of vars needed as the sparsity pattern changes)

Fair enough. Probably there’s no real need for sparse matrix * matrix multiplication. (mildly related, here’s an interesting model: Nystrom approximation slows down Gaussian process regression?)


#34

Instead of GPs, GMRF’s are more likely as they have fixed sparsity structure for the precision matrix. With compact support covariance functions the GP covariance matrix can be sparse, but it’s sparsity structure depends on the lengthscale which would be usually an unknown parameter and it seems the changing sparsity structure is difficult for Stan.


#35

So that’s the constructor. I don’t think it’s useful to expose the internals to the user. Most use cases shouldn’t need random access. Or, to put it differently, can you think of one that does?

There are certain things that you really need to carry forward internally to do something like a sparse cholesky (if you want them to be continuous!) that are only relevant to positive definite matrices.

The checking is necessary for any function that relies on positive definiteness, but I do agree that if you do A + B you shouldn’t check positive defniteness.

Yeah - these aren’t great models. I’m not morally opposed to them or anything, but they’re quite hard to use properly and the Nk^2 cost makes them inappropriate for big data.


#36

Or, to put it differently, can you think of one that does?

Other than the construction, nah, I just think that’s important.

It’s really nice to build a sparse matrix like:

sparse_matrix[N, N] m;
for(i in 1:N)
  for(j in 1:N) {
    if(f(i, j) < 1.0) {
      m[i, j] = something;
    }
  }

vs. remembering how csr/csc work and building those arrays manually. Also since Stan doesn’t have dynamically sized types writing that loop out like this:

int i_s[M];
int j_s[M];
real v_s[M];
for(i in 1:N)
  for(j in 1:N) {
    if(f(i, j) < 1.0) {
      i_s[m] = i;
      j_s[m] = j;
      v_s[m] = something;
      m = m + 1;
    }
  }

Is annoying and would require computing M somehow (which isn’t always easy given how declarations need to be above code in Stan models). I guess building csr/csc directly would be something similar to that.

I guess folks could pretty easily mess themselves up and change the structure of the matrix somehow, but I dunno, seems weird to me. Folks can already easily do this with the dense types if they wanted to. Can’t guard them from everything.

sparsity structure depends on the lengthscale which would be usually an unknown parameter and it seems the changing sparsity structure is difficult for Stan.

Woops, I was thinking about doing the cutoff before scaling by the inverse length scale, but that sounds wrong now that I think about it.

Yeah - these aren’t great models.

I meant the guy managed to figure out how to do random subsampling of a matrix in Stan. Maximum subversion :D.


#37

I’m not sure where we stand on row-major vs. column-major for Stan’s functions, though I’m not sure it matters so much given we won’t be exposing a type to users. For the sparse_multiply(sparse A, dense y), this doc suggests to me that internally Stan should build a row-major matrix because it allows users to use a multithreaded algorithm for the double multiplication that will undoubtedly be happening internally:

Currently, the following algorithms can make use of multi-threading:

  • general dense matrix - matrix products
  • PartialPivLU
  • row-major-sparse * dense vector/matrix products
  • ConjugateGradient with Lower|Upper as the UpLo template parameter.
  • BiCGSTAB with a row-major sparse matrix format.
  • LeastSquaresConjugateGradient

#38

There would need to be a constructor (probably more than one)

Something like

transformed_data {
sparse_matrix Q1 = make_sparse_matrix_csr(n,m,u,v,w);
sparse_matrix Q2 = make_sparse_matrix_csc(n,m,u2,v2,w2);
sparse_matrix Q3 = make_sparse_matrix_ijvalue(n,m,i,j,value);
}


#39

Because the adjoint requires A’*y, this removes a lot of the benefit. (Basically, do we evaluate lp or grad(lp) more often?

Also I’m not sure what the relative speed of the row-major vs column-major cholesky is.


#40

Good point (you mean A’ * \bar{C}, right?). If transposing a row-major matrix precludes the use of MPI, then it won’t matter. But if MPI can still be used for A’ * \bar{C} then it might still make sense to use row-major.

Me neither :/