Improving the efficiency of *_dot_product for trinary-and-redundant matrices

mike-lawrence · January 25, 2023, 7:45pm

Here’s the same plots as above but with the optimization configuration noted at the end of the last post:

First the model configurations where only one method is computed (note that the configurations with the mike method computed in the model block failed to compile):

And the joint configurations (same failure-to-compile for configurations where the mike method was computed in the model block:

Which replicates my original observation whereby under this optimization configuration, the mike method performs poorly except in that unique case where they’re both computed in the TP block & a diff is computed between the cdp & mike outputs.

WardBrian · January 26, 2023, 9:15pm

Can you upload the error you get somewhere?

mike-lawrence · January 26, 2023, 9:33pm

Sure. Here’s the error:

/tmp/RtmpgEpPax/model-c8f35b59817e.hpp: In member function ‘stan::scalar_type_t<T2> none_model_NA_NA_NA_none_NA_1_model_namespace::none_model_NA_NA_NA_none_NA_1_model::log_prob_impl(VecR&, VecI&, std::ostream*) const’:
/tmp/RtmpgEpPax/model-c8f35b59817e.hpp:1000:48: error: ‘inline_mike_dot_return_sym1__’ was not declared in this scope
 1000 |           stan::math::normal_lpdf<propto__>(Y, inline_mike_dot_return_sym1__,
      |                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

make: *** [make/program:58: /tmp/RtmpgEpPax/model-c8f35b59817e] Error 1

Error: An error occured during compilation! See the message above for more information.

And here’s an example stan model that triggers that error:

functions{
	row_vector mike_dot(
		int r
		, int c
		, vector beta
		, row_vector zeros
		, array[,] int index_pos_X_mat
		, array[] int index_pos_X_sizes
		, array[,] int col_index_into_X_first_unique_pos_X_mat
		, array[] int col_index_into_X_first_unique_pos_X_sizes
		, array[,] int index_of_cZ_for_each_col_in_Z_with_pos_X_mat
		, array[] int index_of_cZ_for_each_col_in_Z_with_pos_X_sizes
		, array[,] int index_neg_X_mat
		, array[] int index_neg_X_sizes
		, array[,] int col_index_into_X_first_unique_neg_X_mat
		, array[] int col_index_into_X_first_unique_neg_X_sizes
		, array[,] int index_of_cZ_for_each_col_in_Z_with_neg_X_mat
		, array[] int index_of_cZ_for_each_col_in_Z_with_neg_X_sizes
	) {
		row_vector[c] out = zeros ;
		for(i_r in 1:r){
			out[
				index_pos_X_mat[
					i_r
					, 1:(index_pos_X_sizes[i_r])
				]
			] = (
				out[
					col_index_into_X_first_unique_pos_X_mat[
						i_r
						, 1:(col_index_into_X_first_unique_pos_X_sizes[i_r])
					]
				]
				+ beta[i_r]
			)[
				index_of_cZ_for_each_col_in_Z_with_pos_X_mat[
					i_r
					, 1:(index_of_cZ_for_each_col_in_Z_with_pos_X_sizes[i_r])
				]
			] ;
			out[
				index_neg_X_mat[
					i_r
					, 1:(index_neg_X_sizes[i_r])
				]
			] = (
				out[
					col_index_into_X_first_unique_neg_X_mat[
						i_r
						, 1:(col_index_into_X_first_unique_neg_X_sizes[i_r])
					]
				]
				- beta[i_r]
			)[
				index_of_cZ_for_each_col_in_Z_with_neg_X_mat[
					i_r
					, 1:(index_of_cZ_for_each_col_in_Z_with_neg_X_sizes[i_r])
				]
			] ;
		}
		return(out);
	}
}

data{
	int r ;
	int c ;
	matrix[r,c] X;
	row_vector[c] Y;

	int index_pos_X_mat_rows ;
	int index_pos_X_mat_cols ;
	array[index_pos_X_mat_rows,index_pos_X_mat_cols] int index_pos_X_mat ;
	array[index_pos_X_mat_rows] int index_pos_X_sizes ;

	int col_index_into_X_first_unique_pos_X_mat_rows ;
	int col_index_into_X_first_unique_pos_X_mat_cols ;
	array[
		col_index_into_X_first_unique_pos_X_mat_rows
		,col_index_into_X_first_unique_pos_X_mat_cols
	] int col_index_into_X_first_unique_pos_X_mat ;
	array[
		col_index_into_X_first_unique_pos_X_mat_rows
	] int col_index_into_X_first_unique_pos_X_sizes ;

	int index_of_cZ_for_each_col_in_Z_with_pos_X_mat_rows ;
	int index_of_cZ_for_each_col_in_Z_with_pos_X_mat_cols ;
	array[
		index_of_cZ_for_each_col_in_Z_with_pos_X_mat_rows
		, index_of_cZ_for_each_col_in_Z_with_pos_X_mat_cols
	] int index_of_cZ_for_each_col_in_Z_with_pos_X_mat ;
	array[
		index_of_cZ_for_each_col_in_Z_with_pos_X_mat_rows
	] int index_of_cZ_for_each_col_in_Z_with_pos_X_sizes ;


	int index_neg_X_mat_rows ;
	int index_neg_X_mat_cols ;
	array[index_neg_X_mat_rows,index_neg_X_mat_cols] int index_neg_X_mat ;
	array[index_neg_X_mat_rows] int index_neg_X_sizes ;

	int col_index_into_X_first_unique_neg_X_mat_rows ;
	int col_index_into_X_first_unique_neg_X_mat_cols ;
	array[
		col_index_into_X_first_unique_neg_X_mat_rows
		,col_index_into_X_first_unique_neg_X_mat_cols
	] int col_index_into_X_first_unique_neg_X_mat ;
	array[
		col_index_into_X_first_unique_neg_X_mat_rows
	] int col_index_into_X_first_unique_neg_X_sizes ;

	int index_of_cZ_for_each_col_in_Z_with_neg_X_mat_rows ;
	int index_of_cZ_for_each_col_in_Z_with_neg_X_mat_cols ;
	array[
		index_of_cZ_for_each_col_in_Z_with_neg_X_mat_rows
		, index_of_cZ_for_each_col_in_Z_with_neg_X_mat_cols
	] int index_of_cZ_for_each_col_in_Z_with_neg_X_mat ;
	array[
		index_of_cZ_for_each_col_in_Z_with_neg_X_mat_rows
	] int index_of_cZ_for_each_col_in_Z_with_neg_X_sizes ;

}

transformed data{
	row_vector[c] zeros = zeros_row_vector(c) ;
}

parameters{
	vector[r] beta ;
}
transformed parameters{

}

model{

	row_vector[c] Z_mike ;

	profile("mike"){
		Z_mike = mike_dot(
			r
			, c
			, beta
			, zeros
			, index_pos_X_mat
			, index_pos_X_sizes
			, col_index_into_X_first_unique_pos_X_mat
			, col_index_into_X_first_unique_pos_X_sizes
			, index_of_cZ_for_each_col_in_Z_with_pos_X_mat
			, index_of_cZ_for_each_col_in_Z_with_pos_X_sizes
			, index_neg_X_mat
			, index_neg_X_sizes
			, col_index_into_X_first_unique_neg_X_mat
			, col_index_into_X_first_unique_neg_X_sizes
			, index_of_cZ_for_each_col_in_Z_with_neg_X_mat
			, index_of_cZ_for_each_col_in_Z_with_neg_X_sizes
		) ;
	}

	beta ~ std_normal() ;
	Y ~ normal(Z_mike,1.0) ;
}

generated quantities{

}

WardBrian · January 26, 2023, 10:36pm

Huh, that’s a bug in the inliner that is easy to fix. No clue if it will end up impacting the rest of your examples or not, but I’ll ping you tomorrow when I have a patch

mike-lawrence · January 27, 2023, 3:33am

Oh, I just caught this. The plots above include cases where both methods are computed in the GQ block. See what you expect there @WardBrian?

WardBrian · January 27, 2023, 3:27pm

Stanc patch here: Consistently treat profile blocks as blocks in optimizer by WardBrian · Pull Request #1281 · stan-dev/stanc3 · GitHub

For some reason I can no longer run the comparison script myself (getting Error evaluating the log probability at the initial value. Exception: normal_lpdf: Random variable[1] is nan, but must be not nan! (in '/tmp/RtmpT1P7cg/model-70fe564d404cd.stan', line 19, column 1 to column 27)) to see if this has any impact on the behavior besides fixing the compilation in the model block, but I would be surprised if it did

mike-lawrence · January 27, 2023, 4:21pm

@WardBrian are there intstructions anywhere for building cmdstan using a branch of stanc?

Also, for posterity, here’s (7.5 KB) a zip file (using a .txt extension to upload, you’ll have to change it to .zip) with the automated writing of the various model configurations and sampling thereof.

WardBrian · January 27, 2023, 4:28pm

If you have the ocaml toolchain installed, you can set a make/local variable STANC to the path to a local repo and it will build and use that stanc.

I’ve also kicked off a job which will build the binaries here, so once that is complete you can download the one for your system and just place it in $CMDSTAN/bin/stanc

mike-lawrence · February 1, 2023, 3:44pm

@WardBrian So the mike-declared-and-computed-in-model-block bug blocking compilation of those models is fixed, but the performance pattern persists whereby when using -O1 the mike method only performs better than columns_dot_product in that obscure special case while without -O1 the mike method is superior in all cases (except when computed in generated quantities). Have any ideas what I can do to help track down this anomaly further?

WardBrian · February 1, 2023, 3:54pm

I think diffing and inspecting the generated C++ in those situations is the only thing really to do. I’ve been meaning to look at it again but haven’t had the chance

mike-lawrence · February 5, 2023, 3:52pm

OK, so
here’s (1.1 MB)‘s a zip-renamed-text with two stan models’ code, hpp files and exes, compiled with

stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE"
	, "LDLIBS += -lblas -llapack -llapacke"
)

I’m not very C++ savvy at all, but it strikes me that where the fast model has this:

      Eigen::Matrix<local_scalar_t__,1,-1> Z_mike =
        Eigen::Matrix<local_scalar_t__,1,-1>::Constant(c, DUMMY_VAR__);

the slow model has:

      stan::conditional_var_value_t<local_scalar_t__,
        Eigen::Matrix<local_scalar_t__,1,-1>> Z_mike =
        stan::conditional_var_value_t<local_scalar_t__,
          Eigen::Matrix<local_scalar_t__,1,-1>>(Eigen::Matrix<double,1,-1>::Constant(c,
                                                  std::numeric_limits<double>::quiet_NaN(
                                                    )));

Similarly, where the fast model has:

        Eigen::Matrix<local_scalar_t__,1,-1> inline_mike_dot_return_sym5__;
...
          Eigen::Matrix<local_scalar_t__,1,-1> inline_mike_dot_out_sym6__;

the slow model has:

      stan::conditional_var_value_t<local_scalar_t__,
          Eigen::Matrix<local_scalar_t__,1,-1>> inline_mike_dot_return_sym5__;
...
          stan::conditional_var_value_t<local_scalar_t__,
            Eigen::Matrix<local_scalar_t__,1,-1>> inline_mike_dot_out_sym6__;

Not sure the implications of those differences, but a clue for you @WardBrian ?

WardBrian · February 5, 2023, 4:11pm

Yes, those declarations in the “slow” case are what we use to enable SoA matrices.

There is a flag to stanc called -fno-soa which can disable these for debugging. Assuming your hypothesis is correct, using that should make the differences go away.

Thanks for digging into this! It’s possible you found either a case where we shouldn’t be promoting to SoA matrices, or just that they’re slower in this case

mike-lawrence · February 6, 2023, 6:36pm

Oh, and @WardBrian , I remembered that when using optimization flags, the performance of the mike method (I really need a better name for it!) depended on whether it’s output and the cdp output were used in the computation of another tp (“diff”, used to validate precisely-equivalent computations):

And here’s the hpp files:
joint_noDiff_yesOpts.hpp (73.0 KB)
joint_yesDiff_yesOpts.hpp (73.6 KB)

wherein it looks like something about including the diff computation forces the faster declaration:

WardBrian · February 6, 2023, 6:58pm

The SoA theory also has an easy explanation for that: the max function doesn’t support SoA so the diff there would force a demotion to AoS

mike-lawrence · February 6, 2023, 8:04pm

@WardBrian I can confirm that using stanc_options = list('01 -fno-soa') eliminates the slowdown of the mike method!

Now, as I understand it, the thing we’re turning off with -fno-soa is the struct-of-arrays optimization, which presumably has a motivation and monolithically disabling it should lose its intended benefits, no?

If so, I wonder if there’s something about my implementation of my core idea (using indexing to reduce redundancies in the compute) that is triggering this need to disable SOA and if there might be an alternative implementation that avoids it?

WardBrian · February 6, 2023, 8:18pm

Yes, a large portion of the benefit seen by using --O1 at all is probably due to the SoA optimization, especially if you have parameters which are matrices or vectors. They’re generally preferrable, except for if you’re doing a lot of indexing and assignment, like in a loop.

It seems like the compiler should probably be marking out in your code as a bad candidate, but it does not.

WardBrian · February 6, 2023, 8:41pm

In the mean time, if you’d like to prevent SoA-ing only the your function and it’s arguments, you can use a dummy call to a function which isn’t (and probably never will be) SoA compatible, like append_col.
E.g., change the declaration of out to be row_vector[c] out = append_col(zeros, []);

This should prevent out from ever being considered for SoA promotion (note: this also prevents beta from being SoA).

Whether or not this is a “good idea” will depend on the rest of the model, since depending on how beta is used, the fact that it is not SoA could be more of a slowdown than this method is a speedup. It’s worth noting that, at least for now, columns_dot_product is also not SoA compatible, so beta in this kind of model will be AoS already.

Now that we know this was the cause, it almost seems obvious in hindsight. Replacing multiplications with indexing will be faster for doubles, and probably faster for vars, but slower for var<matrix> (SoA) types because indexing and assigning is slower for them than a multiply should be

stevebronder · February 7, 2023, 7:40pm

Sorry I’m just getting caught up to speed on this. I’ll try to take a look at the example and code this weekend. Very possible I’m missing a piece somewhere in the SoA optimization where this should get demoted.

stevebronder · May 3, 2023, 7:41pm

Just put up a PR that should help with this

github.com/stan-dev/stan

Cleanup varmat multi-index to use a hashmap instead of a set

stan-dev:develop ← stan-dev:feature/faster-varmat-multi-index

opened 07:38PM - 03 May 23 UTC

SteveBronder

+161 -191

#### Submission Checklist - [x] Run unit tests: `./runTests.py src/test/unit`… - [x] Run cpplint: `make cpplint` - [x] Declare copyright holder and open-source license: Steve Bronder #### Summary When we first wrote the assignments for the new matrix type we didn't optimize very hard for `multi_index` assignments. The discourse thread [here](https://discourse.mc-stan.org/t/improving-the-efficiency-of-dot-product-for-trinary-and-redundant-matrices/29940/38) makes a case for using these so I decided to fix these up a bit. Instead of using a set to keep track of unique indices being assigned to we just use a map with the key being the index to assign to and the value being the linear index position to access the right hand side vector / matrix. I haven't benchmarked this but it should take this from O(N * ln(N)) to O(N). It also cuts out an N sized vector we used to originally store indices. It also cuts out an if statement we had to do in the loop which should be much nicer for the CPU #### Intended Effect Make multi index assignment more performant #### How to Verify All tests still pass for multi index assignment #### Side Effects None hopefully #### Documentation No doc changes #### Copyright and Licensing Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Steve Bronder By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause) - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

The long and short is that we wrote multi index assignments in a very slow way because we didn’t think they were used very often and wanted to optimize elsewhere. But I went through and fixed this up in a way that should make mutli index assignment much nicer.

Topic		Replies	Views
Dot_product vs vectorization Developers	6	2196	April 17, 2018
Rows_dot_product is intended as a shorthand rather than compute optimization, right? Modeling	10	1081	February 15, 2021
PSA: where possible, use columns_dot_product rather than rows_dot_product Modeling techniques , specification	1	776	October 28, 2021
Using rows_dot_product in a loop to set column contents; any optimization ideas? Modeling techniques , specification	2	396	August 16, 2021
[Case-study preview] Speeding up Stan by reducing redundant computation Publicity performance	8	2119	June 6, 2020

Improving the efficiency of *_dot_product for trinary-and-redundant matrices

Related topics