Improving the efficiency of *_dot_product for trinary-and-redundant matrices

In hierarchical modelling scenarios we often have an expression taking the dot product of each row (or column) of a data matrix and a parameter vector:

data{
    int r ;
    int c ;
    matrix[r,c] X;
    ... // other data 
}
model{
    vector[c] beta ;
    ... // other parameters
}
model{
    ... // priors
    X_dot_beta = rows_dot_product(X,beta) ;
    ... // more model, using X_dot_beta in likelihood
}

Often X consists of columns of so-called “contrasts”, often consisting of the values -1, 0, or 1. With such matrices, the use of multiplication in the dot-product simply serves to “keep” the corresponding beta for contribution to the sum if the value in the column is 1, ignore the corresponding beta when computing the sum if the value in the column is 0, and “keep but flip the sign” if the value in the column is -1. While the dot-product is a useful shortcut to express things, computationally it’s very expensive, so it should be possible to achieve speedup by instead using either loops-and-logic or pre-computed indexes to achieve the same thing as what the multiplies achieve in the dot-product.

Furthermore, since there’s high redundancy in the columns of the X matrix, it should be possible to build up sums in a tree-like process to reduce the total number of addition operations.

Below is an example (in R code) of very inelegant implementation of both the multiply-avoidance and branched-sums. The standard rows_dot_product would compute r \times c multiplications then r \times c additions, which works out to 648 operations in the 81 \times 4 case of the demo below, while the branched-sums approach achieves an identical result with far fewer operations (80; I haven’t worked out the generalized formula yet for expected number of ops in terms of r and c, but there surely is one).

Obviously there’s a lot of overhead in terms of loops and caching of intermediate results, so I haven’t bothered trying to implement in Stan for a performance comparison, but I wanted to post here in case the more CS-savvy might see a more elegant solution to achieve the same effect. Thoughts?

library(tidyverse)

# make data
(
	expand_grid(
		a = c(-1,0,1)
		, b = c(-1,0,1)
		, c = c(-1,0,1)
		, d = c(-1,0,1)
	)
	%>% as.matrix()
) ->
	X

#example parameter vector
beta = rnorm(ncol(X))

# compute the standard row-wise dot-product
expected_dot_out = t(X%*%beta)

#precompute some indices
X_tmp_xrow_list = list()
for(i in 1:ncol(X)){
	(
		X[,1:i]
		%>% as_tibble()
		%>% mutate(xrow=1:n())
	) ->
		X_with_xrow
	
	(
		X_with_xrow
		%>% group_by_at(.vars=vars(-xrow))
		%>% summarise(
			first_xrow = xrow[1]
			, .groups = 'drop'
		)
		%>% mutate(
			xrow_in_X_tmp = 1:n()
		)
	) ->
		tmp
	(
		X_with_xrow
		%>% select(-xrow)
		%>% left_join(tmp)
		%>% nest(xrow_in_X_tmp=xrow_in_X_tmp)
	) ->
		X_tmp_xrow_list[[i]]
}

# helper function for the no-multiply dot product
dot_like2 = function(X,Z,beta){
	out = Z
	for(i in 1:length(X)){
		if(X[i]!=0){
			if(X[i]>0){
				out[i] = out[i] + beta
			}else{
				out[i] = out[i] - beta
			}
			add_count <<- add_count + 1
		}
	}
	return(out)
}

#initialize a count of additions performed
add_count = 0
Z = rep(0,nrow(X))
#loop over columns to do branched sum
for(i in 1:ncol(X)){
	X_tmp = X[X_tmp_xrow_list[[i]]$first_xrow,1]
	if(ncol(X)>1){
		X = as.matrix(X[,2:ncol(X)])
	}
	Z_tmp = Z[X_tmp_xrow_list[[i]]$first_xrow]
	Z_new = dot_like2(X_tmp,Z_tmp,beta[i])
	Z = Z_new[unlist(X_tmp_xrow_list[[i]]$xrow_in_X_tmp)]
}
print(add_count) #80
all(near(Z,expected_dot_out)) #TRUE

Is the main difference between this and using the sparse matrix utilities built into Stan (8.3 Sparse matrix arithmetic | Stan Functions Reference) the fact that in your implementation, no multiplications are performed because X is known to be either -1, 0, or 1?

I suspect in the end that it would be hard to beat sparse matrix multiplication. My intuition is that it’s faster to just do the multiplications than have a branching structure, because you completely avoid any branch prediction misses and the processor can do a lot of prefetching.

I don’t think these matrices are anywhere near the recommended 90% sparsity that is necessary to see any performance benefit from using sparse representations.

I believe rstanarm performs this optimization for the Z effects matrix, if you search through the stan files for a variable called special_case (defined here):

Hm, the comment on that declaration seems to indicate it’s for a purpose unrelated to this. Or at least, is achieving something akin to what I’ve described for only the simplest scenario where there’s one column in X that’s all ones.

I think it is the same idea you’re expressing here just simpler. For one, the Z involved actually is given in sparse representation, like @hsusmann suggests, and for another it doesn’t treat -1 specially.

It’s a bit difficult to trace the variables through the code (and the comments can be misleading), but that variable does indeed switch between a matrix product and an indexing operation, and if I recall correctly it has quite the speed benefit in doing so:

I’m not sure how far the additional idea of trying to re-use intermediate steps would take you. It seems like it could dramatically reduce the number of arithmetic operations, but my HPC colleagues like to remind me that math is basically free compared to the cost of a cache miss or memory load, so it would be really interesting to see which would win out

1 Like

@bgoodri since you seem to have written a similar trick in rstanarm, care to take a look at the idea in the first post above?

So, I worked out generalizable Stan code for the algorithm I proposed above, and I indeed find that it can achieve 2-3x speedups compared to columns_dot_product!

BUT there’s something super spooky going on with my benchmarking such that when I do:

...
transformed parameters{
    row_vector[c] Z_dot_X_cdp ;
    row_vector[c] Z_dot_X_mike ;
    profile("cdp"){ Z_dot_X_cdp = columns_dot_product( data_matrix_X, parameter_matrix_Z ) ; }
    profile("mike"){ Z_dot_X_mike = mikes_cdot_product( data_matrix_X, parameter_matrix_Z, ... ) ; }
	real max_abs_diff = max(abs(Z_dot_X_cdp-Z_dot_X_mike));
}
...

I observe the above noted 2-3x speedups. However, if I remove the computation of the variable max_abs_diff (there merely as an artifact of my verifying equivalence), and have merely:

...
transformed parameters{
    row_vector[c] Z_dot_X_cdp ;
    row_vector[c] Z_dot_X_mike ;
    profile("cdp_nodiff"){ Z_dot_X_cdp = columns_dot_product( data_matrix_X, parameter_matrix_Z ) ; }
    profile("mike_nodiff"){ Z_dot_X_mike = mikes_cdot_product( data_matrix_X, parameter_matrix_Z, ... ) ; }
}
...

Then the speedup vanishes! And it specifically vanishes because the mike_nodiff profile slows down dramatically. That is, when comparing the profile times, they rank as mike being the fastest, both cdp and cdp_nodif being equivalent next-fastest, and mike_nodiff is by far the slowest. Furthermore when I split things into separate models, columns_dot_product has the same timing as the cdp* times, but my algorithm never achieves the same performance as when put in the same model as columns_dot_product.

So it seems that specifically being in the same model as columns_dot_product AND used in a computation involving the output of the use of columns_dot_product somehow achieves an arcane computational confluence to achieve a substantial speedup, but of course that’s not useful for anyone as real-world usage wouldn’t bother doing the same thing twice. I’ve reached the limits of my technical and experimental expertise to account for how this could even happen, so I thought I’d post back here in the hopes that someone can discern what’s going on here. My hope is that whatever is happening to make my method fast in that obscure benchmarking scenario can be made to happen when it is run in more typical real-world usage alone.

I’ll post some R & Stan code in a little bit (just cleaning it up now) to demonstrate these results, but note that I did experiment with whether the order of computation matters; that is, above I show code where columns_dot_product is used first, then my algorithm, but the above observations persist when that order is reversed. (So I don’t think it’s a trivial matter of the second computation benefiting from the first computation persisting in cache anywhere; at least not in a trivial manner).

Thoughts?

4 Likes

Here’s the R code and Stan models:
compare_cdp_mike.r (2.5 KB)
cdp.stan (288 Bytes)
mike.stan (4.2 KB)
yesdiff_order1.stan (4.4 KB)
yesdiff_order0.stan (4.4 KB)
nodiff_order1.stan (4.4 KB)
nodiff_order0.stan (4.4 KB)
do_plot.r (615 Bytes)
make_data_for_branched_cols.r (1.3 KB)

The order dependence you describe above made me suspect of something in the compiler optimizations. Here is what I get on my machine with stanc --O1 and default make/local (aka O=3).

# A tibble: 10 × 2
   name                value
   <fct>               <dbl>
 1 yesdiff_order1_mike  2.41
 2 yesdiff_order0_mike  2.45
 3 nodiff_order0_cdp    2.98
 4 yesdiff_order1_cdp   3.08
 5 yesdiff_order0_cdp   3.13
 6 cdp_alone            3.22
 7 nodiff_order1_cdp    3.34
 8 nodiff_order0_mike  12.0 
 9 mike                13.6 
10 nodiff_order1_mike  15.0 

If I do this again with O=0 and no stanc optimizations, I get:

# A tibble: 10 × 2
   name                value
   <fct>               <dbl>
 1 mike                 2.38
 2 yesdiff_order1_mike  2.62
 3 yesdiff_order0_mike  2.67
 4 nodiff_order1_mike   2.84
 5 nodiff_order0_mike   2.88
 6 yesdiff_order1_cdp   3.41
 7 nodiff_order0_cdp    3.45
 8 nodiff_order1_cdp    3.48
 9 cdp_alone            3.49
10 yesdiff_order0_cdp   3.63

which is… weird. All the mikes are now faster than the cdps, and the overall variance is way lower. At this point I think I know what’s going on.

So I ran with stanc optimizations but still O=0 in make/local and got:

# A tibble: 10 × 2
   name                value
   <fct>               <dbl>
 1 yesdiff_order1_mike  2.80
 2 yesdiff_order0_mike  3.00
 3 yesdiff_order1_cdp   3.53
 4 cdp_alone            3.54
 5 nodiff_order1_cdp    4.00
 6 nodiff_order0_cdp    4.01
 7 yesdiff_order0_cdp   4.43
 8 mike                12.5 
 9 nodiff_order0_mike  17.3 
10 nodiff_order1_mike  17.4 

Which I’m willing to call essentially identical to the first run since I had other processes running, etc.

Basically, it seems like the weird results you were observing are 100% due to edge cases in the Stanc compiler optimizations. I’m going to do some digging into what is happening, and probably open a bug report based on it, but that’s the answer.

4 Likes

Visually inspecting the .hpp files for both --O0 and --O1, it honestly doesn’t look like much is happening besides function inlining.

Edit: The following is probably not the real issue, see below

My guess, though, is that function inlining performs a lot of extra copies compared to a normal function call. That’s because for any container type, we pass it to a function by const reference, but when we inline we end up generating calls to stan::math::assign (I bet we could avoid this, maybe @stevebronder can weigh in). It’s possible that the order dependence was caused by the C++ compiler realizing it can elide some of these if the call to mike happens after the first call, but that’s speculation on my part.

2 Likes

Edit: The following is probably not the real issue, see below

Hm, function inlining also may be creating unnecessary vars! For example, the argument zeros is a matrix of double, which is what the template parameter expands to under normal function evaluation. But when it is inlined, we end up with

          Eigen::Matrix<local_scalar_t__,1,-1> inline_mike_dot_out_sym2__;
          stan::model::assign(inline_mike_dot_out_sym2__, zeros,
            "assigning variable inline_mike_dot_out_sym2__");

Where local_scalar_t__ is var in this case. This is actually the source of the copy, I think, since assign on two matrices of doubles should just be a forward.

Definitely worth investigating in the compiler. My guess is really bad behavior is only seen when you are passing several/large containers of data to a function.

2 Likes

This is super interesting and possibly why I’ve seen Stan struggle as I scale my data size up. Can’t wait to see this fixed.

1 Like

Maybe I’m misunderstanding your theory here, but note that the performance pattern persists when my approach is written as code right in the TP block rather than as a user-defined function.

FYI, Daniel Lee suggested (over on the slack) that I try some permutations of the order of declarations/computations, so I’m working through that now (using automated writing of individual Stan files reflecting a given permutation).

I’m not sure if we currently have anyone familiar enough with the function inliner to fix this quickly, so it might be a while. Also, probably, the fix that does come will only really work if you mark your function’s argument types as data, which is a good idea whenever you know they are!

That again seems reasonable to me, since writing it directly is essentially what inlining does. The further changes in speed due to ordering or the presence/absense of something are (again, speculation) possibly caused by the C++ compiler, though verifying that is hard in practice.

My guess would be, though, that adding another set of profiles in the generated quantities block would show the same behavior as my “no stanc optimizations” run above, e.g. mike being slightly faster in all cases. Function inlining in that block should never create vars

Cool, that case is in the set of permutations I’m looking at now…

1 Like

Hm, my original guess about why the function inliner was slow might have been too hasty - I think what I was originally reading as unnecessary copies actually are just part of @mike-lawrence’s code which indeed are necessary (e.g. the line row_vector[c] out = zeros;, which is a genuine copy), the name just gets mangled.

So, I’m back to believing the function inliner has something to do with it, but not really sure what.

I don’t know if it helps at all, but just now I was trying to compile models after first running

stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE"
	, "LDLIBS += -lblas -llapack -llapacke"
)
cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = FALSE)
cmdstanr::rebuild_cmdstan(cores = parallel::detectCores()/2)

And then compiling as

mod = cmdstanr::cmdstan_model(
	stan_file = my_stan_file
	, cpp_options = cpp_options
	, stanc_options = stanc_options
	, force_recompile = TRUE
)

And I was only able to get models where both declare and compute is done in the TP block to compile; those with declare and compute done in either the model block alone or split between the model & TP blocks failed to compile.

BUT deleting my make/local file, rebuilding cmdstan, and passing NULL as the cpp & stanc options yields successful builds.

So I put together a script to automatically piece together the core stan file components in a variety of orders (where are things declared, where are things computed, order of declarations, order of computing, etc), yielding 100 combinations. I compiled cmdstan with an empty makefile and compiled each model using no stanc or cpp options. Then I sampled each using the same data across all sampling runs, 16 runs per combination, and the same seed per run across all combinations.

Here’s the profiled total_time for the combinations where either columns_dot_product() (“cdp”) or mike_dot() (“mike”) were used alone:


Facet columns differentiate where the computation took place, and facet rows show the two methods plus the ratio at the bottom.

So, cdp is faster only when computed in the GQ block.

Here’s the same (without the ratios) for the combinations where both were computed in the model:

And here’s the ratios:

So, again, cdp is faster only when it is computed in GQ.

This matches the results from Brian’s data above under conditions of no optimizations.

I’ll next iterate over a few optimization configurations. My first will be the configuration where I expect to observe my original result (whereby mike was faster in only a small subset of model configurations), which had cmdstan build via:

stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE"
	, "LDLIBS += -lblas -llapack -llapacke"
)
fs::file_delete(cmdstanr::cmdstan_path(),'make','local') #necessary as cmdstan_make_local doesn't erase what's there if cpp_options=NULL
cmdstanr::cmdstan_make_local(cpp_options = cpp_options, append = FALSE)
cmdstanr::rebuild_cmdstan(cores = parallel::detectCores()/2)

and models compiled via

cmdstanr::cmdstan_model(
	stan_file = x$stan_file_path
	, cpp_options = cpp_options
	, stanc_options = stanc_options
)

@WardBrian, would any other optimization configurations be of interest?

Note I’m saving all the profiling information from all of this, so we can dig into the details of what model & optimization configurations yield what profiling metrics in case things beyond just time are diagnostic of what’s going on here. (I know Brian you’ve already filed a bug report associated with one theory for what’s happening, I’m just trying to be super thorough in characterizing things while we wait for that issue to be fixed)

1 Like