Cost of error checking in the math library

@bgoodri, I think you mentioned that the cost of error checking was prohibitive for sparse matrices during the meeting. Did I remember that correctly?

Mind putting up an example? In most functions, I haven’t actually seen a performance difference when I use full optimized code, but I can believe it does exist. I think one of our checks still does a cholesky decompose inside, which we should rip out.

But, perhaps we can design a better way to check for input.

The problem here is, if I understood Ben correctly, that because there’s no encapsulated type for sparse matrices yet every time anything happens with a sparse matrix a bunch of things that are data are checked again. These are fairly long vectors, which makes it slow.

Save this file as speed.stan, which reimplements what csr_matrix_times_vector does but without as much checking:

functions {
  vector slow(int m, int n, vector w, int[] v, int[] u, vector b) {
    return csr_matrix_times_vector(m, n, w, v, u, b);
  }
  
  vector fast(int m, int n, vector w, int[] v, int[] u, vector b) {
    vector[m] out;
    int pos = 1;
    for (k in 1:m) {
      int nnz = u[k + 1] - u[k];
      if (nnz > 0) {
        vector[nnz] left;
        vector[nnz] right;
        for (j in 1:nnz) {
          right[j] = b[v[pos]];
          left[j] = w[pos];
          pos = pos + 1;
        }
        out[k] = dot_product(left, right);
      }
      else out[k] = 0;
    }
    return out;
  }
}

In R, do

rstan::expose_stan_functions("speed.stan")
K <- 1000L
X <- matrix(rbinom(K^2, size = 1, prob = 0.25), nrow = K, ncol = K)
b <- rnorm(K)
parts <- rstan::extract_sparse_parts(X)
all.equal(with(parts, slow(m = nrow(X), n = ncol(X), w, v, u, b)),
          with(parts, fast(m = K, n = K, w, v, u, b))) # TRUE

Now do, in R

slow_time <- system.time(replicate(1000, dim(with(parts, slow(m = K, n = K, w, v, u, b)))))
fast_time <- system.time(replicate(1000, dim(with(parts, fast(m = K, n = K, w, v, u, b)))))

The “slow” version with csr_matrix_times_vector is over three times slower than the “fast” version that avoids some of the checking. The “fast” version is still checking that all the indices are in bounds.

It isn’t so much that checking is not a worthwhile idea (although it might be for expensive checks), it is that checking doubles and ints every leapfrog step is not worthwhile after the first evaluation. Also, there is currently no way to turn all the checks off for packages like rstanarm that come with compiled models and their own units tests. And even if we refactored things so that was possible, it would be difficult to turn expensive checks off and leave cheap checks on.

In nomad even lightweight checks for NaNs and parameter constraints were very expensive and templating them off increased run time by an order of magnitude even in simple functions.

Awesome. Thanks for the example. I’m going to do a little digging.

@bgoodri, I don’t get the same results that you do:

> slow_time
   user  system elapsed 
  1.002   0.008   1.010 
> fast_time
   user  system elapsed 
  4.394   0.014   4.417 

I’m on a Mac. My session info (after I loaded library(rstan)):

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.16.2         StanHeaders_2.16.0-1 ggplot2_2.2.1       

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2 scales_0.4.1     compiler_3.4.0   lazyeval_0.2.0  
 [5] plyr_1.8.4       inline_0.3.14    gtable_0.2.0     tibble_1.3.3    
 [9] gridExtra_2.2.1  Rcpp_0.12.13     grid_3.4.0       stats4_3.4.0    
[13] rlang_0.1.1      munsell_0.4.3   
> 

Thoughts on how this is possibly happening?

The only thing in my ~/.R/Makevars is

CXXFLAGS=-O3

I just tried this on the Mac pro in the office by sshing into it:

> slow_time
   user  system elapsed 
  1.680   0.006   1.790 
> fast_time
   user  system elapsed 
  9.543   0.013  10.161 

I was also using clang++ (with g++ the slow_time is similar to that of clang++ but the fast_time is not all that much faster). For both my laptop and the Linux computer in my office I get about

> slow_time
   user  system elapsed 
 27.800   0.000  27.804 
> fast_time
   user  system elapsed 
  9.636   0.000   9.637 

I guess your compiler is either memoizing or at least inlining the slow function much differently. To rule out memoisation, can you try using different random coefficients each time with

slow_time <- system.time(replicate(1000, dim(with(parts, slow(m = K, n = K, w, v, u, b = rnorm(K))))))
fast_time <- system.time(replicate(1000, dim(with(parts, fast(m = K, n = K, w, v, u, b = rnorm(K))))))

I did that and got roughly the same results.

> set.seed(100)
> slow_time <- system.time(replicate(1000, dim(with(parts, slow(m = K, n = K, w, v, u, b = rnorm(K))))))
> set.seed(100)
> fast_time <- system.time(replicate(1000, dim(with(parts, fast(m = K, n = K, w, v, u, b = rnorm(K))))))
> slow_time
   user  system elapsed 
  1.354   0.303   1.669 
> fast_time
   user  system elapsed 
  4.721   0.377   5.135 

Any other ideas? Check your compiler options?

When I try this on my Mac I’m seeing similar results to Ben’s:

> slow_time
   user  system elapsed 
 37.147   0.190  37.532 
> fast_time
   user  system elapsed 
 10.804   0.035  10.870 

Is anyone else able to get results similar to Daniel’s, where the the slow_time is actually faster than the fast_time?

I got

> slow_time
   user  system elapsed 
  1.702   0.193   1.905 
> fast_time
   user  system elapsed 
  5.085   0.168   5.264 

On an underpowered MacBook with not that much memory where I haven’t specified any compiler flags or done anything fancy (literally all I’ve done is typed install.packages("rstan"))

Edit: I did the thing Ben suggested later and got essentially the same results

> slow_time
   user  system elapsed 
  1.746   0.188   1.943 
> fast_time
   user  system elapsed 
  5.213   0.175   5.399 

Session info:

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13         lattice_0.20-35      grid_3.4.2          
 [4] plyr_1.8.4           gtable_0.2.0         stats4_3.4.2        
 [7] StanHeaders_2.16.0-1 scales_0.5.0         ggplot2_2.2.1       
[10] rlang_0.1.4          lazyeval_0.2.1       Matrix_1.2-11       
[13] tools_3.4.2          munsell_0.4.3        yaml_2.1.14         
[16] rstan_2.16.2         compiler_3.4.2       inline_0.3.14       
[19] colorspace_1.3-2     gridExtra_2.3        tibble_1.3.4   

Thanks @anon75146577. This is super weird though. I’m not sure how those results are possible!

I have the most up to date OS and literally installed Xcode etc on Monday. I’m also running on battery power in a Swedish cafe in Williamsburg and am so hungover I’m voluntarily listening to Limp Bizkit, which I feel maybe relevant.

2 Likes

@jonah I think the specs on our machines are about the same. Maybe we should figure out why it’s running that slow on yours?

We think it is same char* vs string thing that afflicts the models.

Ah. Ok. So that means after that’s patched, you think the slow version will run faster?