Sparse model matrix functionality without the sparse matrix

While we all wait for tuples and sparse matrix support, I finally had a chance to think about non-language solutions to the problem of sparse model matrices and it turns out Stan has a pretty good one. Everything is wrapped up nicely here:

The motivating perspective is nicely described in a book by James Hodges and presents an awful lot of models as (mostly sparse) model matrices. Yey. In Stan you end up using these things as components for describing GLM’s quite a bit so it would be nice to have first class sparse matrix support but that’s … ongoing…

You can hack sparse matrices with ragged array… of course those are not in Stan either so you hack those too. There’s an example script in a little package demo that you can install with devtools::install_github("sakrejda/hierarchy") and run an example script:

script_file = system.file('quick-check/check-02.R', package = 'hierarchy')

The upshot is that with a 10,000 x 200 matrix that’s about 2% non-zero you can easily write a linear model that’s twice as fast (and uses less memory) if you take

X * betas

and replace it with

for (i in 1:N)
  mu[i] = sum(X_vec[starts[i]:stops[i]] .* betas[NZE[starts[i]:stops[i]]]);

Where X_vec is the non-zero entries of X, NZE are the column-indices of the non-zero entries of X, and i is the row index. The starts/stops here are where the i'th row starts and stops in the dense vectors X_vec and NZE. This data stored here is similar to the CSR sparse matrix format.

In both cases here the X/X_vec can be either data, a parameter, or a transformed parameter. That means it can be used to construct measurement error models and other models where you need to have parameters as (some) entries of the X matrix.

The package I reference above is one I’m putting some tools like this in but the function used here is quite short:

#' Create a sparse list model matrix 
#' @param formula formula to use
#' @param data where to look up terms
#' @return list with Stan-friendly components
#' @export
flat_mm = function(formula = 1, data = NULL, ...) { 
  mm = Matrix::sparse.model.matrix(f, data, ...)
  nze = apply(mm, 1, function(x) which(x != 0))
  N = length(nze)
  stops = sapply(nze, length) %>% cumsum
  starts = c(1, stops[1:(N-1)] + 1)
  nze = unlist(nze)
  X_vec = apply(mm, 1, function(x) x[x != 0])
  mml = list(
    N = N, 
    P = ncol(mm),
    N_NZE = length(nze),
    NZE = nze,
    starts = starts, stops = stops,
    X_vec = unlist(X_vec)
  t = terms(formula)
  pos = attr(t, 'response')
  dn = all.vars(t)[pos]
  mml[[dn]] = data[[dn]]

The version of sparse matrix multiplication we have implemented (in C++, sigh) is slower than this Stan-language version (good job on that language design @Bob_Carpenter et al…!). Oops that I didn’t see this when I first coded the C++ version… it might make sense to replace that…


Hi! Probably I am missing something obvious, but how does this compare/relate to the sparse matrix stuff that is already in Stan? Specifically, csr_matrix_times_vector in Stan and rstan::extract_sparse_parts() in R?

It’s faster… It’s also nearly equivalent minus a copy and maybe less thorough consistency checking

One lesson I keep running into repeatedly in sparse matrix operations/algorithms is that there’s probably no single best format/solution that’s generally applicable, although the kludgy Stan functions we do currently have are very fixable (and probably worth a PR…).