Cost of error checking in the math library

syclik · October 21, 2017, 5:53am

@bgoodri, I think you mentioned that the cost of error checking was prohibitive for sparse matrices during the meeting. Did I remember that correctly?

Mind putting up an example? In most functions, I haven’t actually seen a performance difference when I use full optimized code, but I can believe it does exist. I think one of our checks still does a cholesky decompose inside, which we should rip out.

But, perhaps we can design a better way to check for input.

anon75146577 · October 21, 2017, 10:11am

The problem here is, if I understood Ben correctly, that because there’s no encapsulated type for sparse matrices yet every time anything happens with a sparse matrix a bunch of things that are data are checked again. These are fairly long vectors, which makes it slow.

bgoodri · October 21, 2017, 4:40pm

Save this file as speed.stan, which reimplements what csr_matrix_times_vector does but without as much checking:

functions {
  vector slow(int m, int n, vector w, int[] v, int[] u, vector b) {
    return csr_matrix_times_vector(m, n, w, v, u, b);
  }
  
  vector fast(int m, int n, vector w, int[] v, int[] u, vector b) {
    vector[m] out;
    int pos = 1;
    for (k in 1:m) {
      int nnz = u[k + 1] - u[k];
      if (nnz > 0) {
        vector[nnz] left;
        vector[nnz] right;
        for (j in 1:nnz) {
          right[j] = b[v[pos]];
          left[j] = w[pos];
          pos = pos + 1;
        }
        out[k] = dot_product(left, right);
      }
      else out[k] = 0;
    }
    return out;
  }
}

In R, do

rstan::expose_stan_functions("speed.stan")
K <- 1000L
X <- matrix(rbinom(K^2, size = 1, prob = 0.25), nrow = K, ncol = K)
b <- rnorm(K)
parts <- rstan::extract_sparse_parts(X)
all.equal(with(parts, slow(m = nrow(X), n = ncol(X), w, v, u, b)),
          with(parts, fast(m = K, n = K, w, v, u, b))) # TRUE

Now do, in R

slow_time <- system.time(replicate(1000, dim(with(parts, slow(m = K, n = K, w, v, u, b)))))
fast_time <- system.time(replicate(1000, dim(with(parts, fast(m = K, n = K, w, v, u, b)))))

The “slow” version with csr_matrix_times_vector is over three times slower than the “fast” version that avoids some of the checking. The “fast” version is still checking that all the indices are in bounds.

It isn’t so much that checking is not a worthwhile idea (although it might be for expensive checks), it is that checking doubles and ints every leapfrog step is not worthwhile after the first evaluation. Also, there is currently no way to turn all the checks off for packages like rstanarm that come with compiled models and their own units tests. And even if we refactored things so that was possible, it would be difficult to turn expensive checks off and leave cheap checks on.

betanalpha · October 21, 2017, 5:12pm

In nomad even lightweight checks for NaNs and parameter constraints were very expensive and templating them off increased run time by an order of magnitude even in simple functions.

syclik · October 21, 2017, 5:15pm

Awesome. Thanks for the example. I’m going to do a little digging.

syclik · October 22, 2017, 1:47am

@bgoodri, I don’t get the same results that you do:

> slow_time
   user  system elapsed 
  1.002   0.008   1.010 
> fast_time
   user  system elapsed 
  4.394   0.014   4.417

I’m on a Mac. My session info (after I loaded library(rstan)):

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.16.2         StanHeaders_2.16.0-1 ggplot2_2.2.1       

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2 scales_0.4.1     compiler_3.4.0   lazyeval_0.2.0  
 [5] plyr_1.8.4       inline_0.3.14    gtable_0.2.0     tibble_1.3.3    
 [9] gridExtra_2.2.1  Rcpp_0.12.13     grid_3.4.0       stats4_3.4.0    
[13] rlang_0.1.1      munsell_0.4.3   
>

Thoughts on how this is possibly happening?

The only thing in my ~/.R/Makevars is

CXXFLAGS=-O3

syclik · October 22, 2017, 1:52am

I just tried this on the Mac pro in the office by sshing into it:

> slow_time
   user  system elapsed 
  1.680   0.006   1.790 
> fast_time
   user  system elapsed 
  9.543   0.013  10.161

bgoodri · October 22, 2017, 3:44pm

I was also using clang++ (with g++ the slow_time is similar to that of clang++ but the fast_time is not all that much faster). For both my laptop and the Linux computer in my office I get about

> slow_time
   user  system elapsed 
 27.800   0.000  27.804 
> fast_time
   user  system elapsed 
  9.636   0.000   9.637

I guess your compiler is either memoizing or at least inlining the slow function much differently. To rule out memoisation, can you try using different random coefficients each time with

slow_time <- system.time(replicate(1000, dim(with(parts, slow(m = K, n = K, w, v, u, b = rnorm(K))))))
fast_time <- system.time(replicate(1000, dim(with(parts, fast(m = K, n = K, w, v, u, b = rnorm(K))))))

syclik · October 23, 2017, 1:10pm

I did that and got roughly the same results.

> set.seed(100)
> slow_time <- system.time(replicate(1000, dim(with(parts, slow(m = K, n = K, w, v, u, b = rnorm(K))))))
> set.seed(100)
> fast_time <- system.time(replicate(1000, dim(with(parts, fast(m = K, n = K, w, v, u, b = rnorm(K))))))
> slow_time
   user  system elapsed 
  1.354   0.303   1.669 
> fast_time
   user  system elapsed 
  4.721   0.377   5.135

Any other ideas? Check your compiler options?

jonah · November 12, 2017, 7:48pm

When I try this on my Mac I’m seeing similar results to Ben’s:

> slow_time
   user  system elapsed 
 37.147   0.190  37.532 
> fast_time
   user  system elapsed 
 10.804   0.035  10.870

Is anyone else able to get results similar to Daniel’s, where the the slow_time is actually faster than the fast_time?

anon75146577 · November 12, 2017, 8:50pm

I got

> slow_time
   user  system elapsed 
  1.702   0.193   1.905 
> fast_time
   user  system elapsed 
  5.085   0.168   5.264

On an underpowered MacBook with not that much memory where I haven’t specified any compiler flags or done anything fancy (literally all I’ve done is typed install.packages("rstan"))

Edit: I did the thing Ben suggested later and got essentially the same results

> slow_time
   user  system elapsed 
  1.746   0.188   1.943 
> fast_time
   user  system elapsed 
  5.213   0.175   5.399

Session info:

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13         lattice_0.20-35      grid_3.4.2          
 [4] plyr_1.8.4           gtable_0.2.0         stats4_3.4.2        
 [7] StanHeaders_2.16.0-1 scales_0.5.0         ggplot2_2.2.1       
[10] rlang_0.1.4          lazyeval_0.2.1       Matrix_1.2-11       
[13] tools_3.4.2          munsell_0.4.3        yaml_2.1.14         
[16] rstan_2.16.2         compiler_3.4.2       inline_0.3.14       
[19] colorspace_1.3-2     gridExtra_2.3        tibble_1.3.4

jonah · November 12, 2017, 9:00pm

Thanks @anon75146577. This is super weird though. I’m not sure how those results are possible!

anon75146577 · November 12, 2017, 9:02pm

I have the most up to date OS and literally installed Xcode etc on Monday. I’m also running on battery power in a Swedish cafe in Williamsburg and am so hungover I’m voluntarily listening to Limp Bizkit, which I feel maybe relevant.

syclik · November 13, 2017, 2:59am

@jonah I think the specs on our machines are about the same. Maybe we should figure out why it’s running that slow on yours?

bgoodri · November 13, 2017, 3:30am

We think it is same char* vs string thing that afflicts the models.

syclik · November 13, 2017, 3:32am

Ah. Ok. So that means after that’s patched, you think the slow version will run faster?

Topic		Replies	Views
Math library input checks Developers	53	2831	March 7, 2019
Sparse matrices in Stan Developers	12	2597	February 8, 2018
Should sparse matrices in Stan be row major or column major Developers	47	4669	August 21, 2017
ODEs: Cost of checking Developers	28	1886	May 12, 2017
Matrix operations (1x1) run 10x slower than using reals RStan performance	8	794	June 3, 2020

Cost of error checking in the math library

Related topics