Improving the efficiency of *_dot_product for trinary-and-redundant matrices

Here’s the same plots as above but with the optimization configuration noted at the end of the last post:

First the model configurations where only one method is computed (note that the configurations with the mike method computed in the model block failed to compile):


And the joint configurations (same failure-to-compile for configurations where the mike method was computed in the model block:

Which replicates my original observation whereby under this optimization configuration, the mike method performs poorly except in that unique case where they’re both computed in the TP block & a diff is computed between the cdp & mike outputs.

1 Like

Can you upload the error you get somewhere?

Sure. Here’s the error:

/tmp/RtmpgEpPax/model-c8f35b59817e.hpp: In member function ‘stan::scalar_type_t<T2> none_model_NA_NA_NA_none_NA_1_model_namespace::none_model_NA_NA_NA_none_NA_1_model::log_prob_impl(VecR&, VecI&, std::ostream*) const’:
/tmp/RtmpgEpPax/model-c8f35b59817e.hpp:1000:48: error: ‘inline_mike_dot_return_sym1__’ was not declared in this scope
 1000 |           stan::math::normal_lpdf<propto__>(Y, inline_mike_dot_return_sym1__,
      |                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

make: *** [make/program:58: /tmp/RtmpgEpPax/model-c8f35b59817e] Error 1

Error: An error occured during compilation! See the message above for more information.

And here’s an example stan model that triggers that error:

functions{
	row_vector mike_dot(
		int r
		, int c
		, vector beta
		, row_vector zeros
		, array[,] int index_pos_X_mat
		, array[] int index_pos_X_sizes
		, array[,] int col_index_into_X_first_unique_pos_X_mat
		, array[] int col_index_into_X_first_unique_pos_X_sizes
		, array[,] int index_of_cZ_for_each_col_in_Z_with_pos_X_mat
		, array[] int index_of_cZ_for_each_col_in_Z_with_pos_X_sizes
		, array[,] int index_neg_X_mat
		, array[] int index_neg_X_sizes
		, array[,] int col_index_into_X_first_unique_neg_X_mat
		, array[] int col_index_into_X_first_unique_neg_X_sizes
		, array[,] int index_of_cZ_for_each_col_in_Z_with_neg_X_mat
		, array[] int index_of_cZ_for_each_col_in_Z_with_neg_X_sizes
	) {
		row_vector[c] out = zeros ;
		for(i_r in 1:r){
			out[
				index_pos_X_mat[
					i_r
					, 1:(index_pos_X_sizes[i_r])
				]
			] = (
				out[
					col_index_into_X_first_unique_pos_X_mat[
						i_r
						, 1:(col_index_into_X_first_unique_pos_X_sizes[i_r])
					]
				]
				+ beta[i_r]
			)[
				index_of_cZ_for_each_col_in_Z_with_pos_X_mat[
					i_r
					, 1:(index_of_cZ_for_each_col_in_Z_with_pos_X_sizes[i_r])
				]
			] ;
			out[
				index_neg_X_mat[
					i_r
					, 1:(index_neg_X_sizes[i_r])
				]
			] = (
				out[
					col_index_into_X_first_unique_neg_X_mat[
						i_r
						, 1:(col_index_into_X_first_unique_neg_X_sizes[i_r])
					]
				]
				- beta[i_r]
			)[
				index_of_cZ_for_each_col_in_Z_with_neg_X_mat[
					i_r
					, 1:(index_of_cZ_for_each_col_in_Z_with_neg_X_sizes[i_r])
				]
			] ;
		}
		return(out);
	}
}

data{
	int r ;
	int c ;
	matrix[r,c] X;
	row_vector[c] Y;

	int index_pos_X_mat_rows ;
	int index_pos_X_mat_cols ;
	array[index_pos_X_mat_rows,index_pos_X_mat_cols] int index_pos_X_mat ;
	array[index_pos_X_mat_rows] int index_pos_X_sizes ;

	int col_index_into_X_first_unique_pos_X_mat_rows ;
	int col_index_into_X_first_unique_pos_X_mat_cols ;
	array[
		col_index_into_X_first_unique_pos_X_mat_rows
		,col_index_into_X_first_unique_pos_X_mat_cols
	] int col_index_into_X_first_unique_pos_X_mat ;
	array[
		col_index_into_X_first_unique_pos_X_mat_rows
	] int col_index_into_X_first_unique_pos_X_sizes ;

	int index_of_cZ_for_each_col_in_Z_with_pos_X_mat_rows ;
	int index_of_cZ_for_each_col_in_Z_with_pos_X_mat_cols ;
	array[
		index_of_cZ_for_each_col_in_Z_with_pos_X_mat_rows
		, index_of_cZ_for_each_col_in_Z_with_pos_X_mat_cols
	] int index_of_cZ_for_each_col_in_Z_with_pos_X_mat ;
	array[
		index_of_cZ_for_each_col_in_Z_with_pos_X_mat_rows
	] int index_of_cZ_for_each_col_in_Z_with_pos_X_sizes ;


	int index_neg_X_mat_rows ;
	int index_neg_X_mat_cols ;
	array[index_neg_X_mat_rows,index_neg_X_mat_cols] int index_neg_X_mat ;
	array[index_neg_X_mat_rows] int index_neg_X_sizes ;

	int col_index_into_X_first_unique_neg_X_mat_rows ;
	int col_index_into_X_first_unique_neg_X_mat_cols ;
	array[
		col_index_into_X_first_unique_neg_X_mat_rows
		,col_index_into_X_first_unique_neg_X_mat_cols
	] int col_index_into_X_first_unique_neg_X_mat ;
	array[
		col_index_into_X_first_unique_neg_X_mat_rows
	] int col_index_into_X_first_unique_neg_X_sizes ;

	int index_of_cZ_for_each_col_in_Z_with_neg_X_mat_rows ;
	int index_of_cZ_for_each_col_in_Z_with_neg_X_mat_cols ;
	array[
		index_of_cZ_for_each_col_in_Z_with_neg_X_mat_rows
		, index_of_cZ_for_each_col_in_Z_with_neg_X_mat_cols
	] int index_of_cZ_for_each_col_in_Z_with_neg_X_mat ;
	array[
		index_of_cZ_for_each_col_in_Z_with_neg_X_mat_rows
	] int index_of_cZ_for_each_col_in_Z_with_neg_X_sizes ;

}

transformed data{
	row_vector[c] zeros = zeros_row_vector(c) ;
}

parameters{
	vector[r] beta ;
}
transformed parameters{

}

model{

	row_vector[c] Z_mike ;

	profile("mike"){
		Z_mike = mike_dot(
			r
			, c
			, beta
			, zeros
			, index_pos_X_mat
			, index_pos_X_sizes
			, col_index_into_X_first_unique_pos_X_mat
			, col_index_into_X_first_unique_pos_X_sizes
			, index_of_cZ_for_each_col_in_Z_with_pos_X_mat
			, index_of_cZ_for_each_col_in_Z_with_pos_X_sizes
			, index_neg_X_mat
			, index_neg_X_sizes
			, col_index_into_X_first_unique_neg_X_mat
			, col_index_into_X_first_unique_neg_X_sizes
			, index_of_cZ_for_each_col_in_Z_with_neg_X_mat
			, index_of_cZ_for_each_col_in_Z_with_neg_X_sizes
		) ;
	}

	beta ~ std_normal() ;
	Y ~ normal(Z_mike,1.0) ;
}

generated quantities{

}

Huh, that’s a bug in the inliner that is easy to fix. No clue if it will end up impacting the rest of your examples or not, but I’ll ping you tomorrow when I have a patch

1 Like

Oh, I just caught this. The plots above include cases where both methods are computed in the GQ block. See what you expect there @WardBrian?

Stanc patch here: Consistently treat profile blocks as blocks in optimizer by WardBrian · Pull Request #1281 · stan-dev/stanc3 · GitHub

For some reason I can no longer run the comparison script myself (getting Error evaluating the log probability at the initial value. Exception: normal_lpdf: Random variable[1] is nan, but must be not nan! (in '/tmp/RtmpT1P7cg/model-70fe564d404cd.stan', line 19, column 1 to column 27)) to see if this has any impact on the behavior besides fixing the compilation in the model block, but I would be surprised if it did

@WardBrian are there intstructions anywhere for building cmdstan using a branch of stanc?

Also, for posterity, here’s (7.5 KB) a zip file (using a .txt extension to upload, you’ll have to change it to .zip) with the automated writing of the various model configurations and sampling thereof.

If you have the ocaml toolchain installed, you can set a make/local variable STANC to the path to a local repo and it will build and use that stanc.

I’ve also kicked off a job which will build the binaries here, so once that is complete you can download the one for your system and just place it in $CMDSTAN/bin/stanc

@WardBrian So the mike-declared-and-computed-in-model-block bug blocking compilation of those models is fixed, but the performance pattern persists whereby when using -O1 the mike method only performs better than columns_dot_product in that obscure special case while without -O1 the mike method is superior in all cases (except when computed in generated quantities). Have any ideas what I can do to help track down this anomaly further?

I think diffing and inspecting the generated C++ in those situations is the only thing really to do. I’ve been meaning to look at it again but haven’t had the chance

1 Like

OK, so
here’s (1.1 MB)‘s a zip-renamed-text with two stan models’ code, hpp files and exes, compiled with

stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-march=native -mtune=native -DEIGEN_USE_BLAS -DEIGEN_USE_LAPACKE"
	, "LDLIBS += -lblas -llapack -llapacke"
)

I’m not very C++ savvy at all, but it strikes me that where the fast model has this:

      Eigen::Matrix<local_scalar_t__,1,-1> Z_mike =
        Eigen::Matrix<local_scalar_t__,1,-1>::Constant(c, DUMMY_VAR__);

the slow model has:

      stan::conditional_var_value_t<local_scalar_t__,
        Eigen::Matrix<local_scalar_t__,1,-1>> Z_mike =
        stan::conditional_var_value_t<local_scalar_t__,
          Eigen::Matrix<local_scalar_t__,1,-1>>(Eigen::Matrix<double,1,-1>::Constant(c,
                                                  std::numeric_limits<double>::quiet_NaN(
                                                    )));

Similarly, where the fast model has:

        Eigen::Matrix<local_scalar_t__,1,-1> inline_mike_dot_return_sym5__;
...
          Eigen::Matrix<local_scalar_t__,1,-1> inline_mike_dot_out_sym6__;

the slow model has:

      stan::conditional_var_value_t<local_scalar_t__,
          Eigen::Matrix<local_scalar_t__,1,-1>> inline_mike_dot_return_sym5__;
...
          stan::conditional_var_value_t<local_scalar_t__,
            Eigen::Matrix<local_scalar_t__,1,-1>> inline_mike_dot_out_sym6__;

Not sure the implications of those differences, but a clue for you @WardBrian ?

1 Like

Yes, those declarations in the “slow” case are what we use to enable SoA matrices.

There is a flag to stanc called -fno-soa which can disable these for debugging. Assuming your hypothesis is correct, using that should make the differences go away.

Thanks for digging into this! It’s possible you found either a case where we shouldn’t be promoting to SoA matrices, or just that they’re slower in this case

1 Like

Oh, and @WardBrian , I remembered that when using optimization flags, the performance of the mike method (I really need a better name for it!) depended on whether it’s output and the cdp output were used in the computation of another tp (“diff”, used to validate precisely-equivalent computations):

And here’s the hpp files:
joint_noDiff_yesOpts.hpp (73.0 KB)
joint_yesDiff_yesOpts.hpp (73.6 KB)

wherein it looks like something about including the diff computation forces the faster declaration:

The SoA theory also has an easy explanation for that: the max function doesn’t support SoA so the diff there would force a demotion to AoS

@WardBrian I can confirm that using stanc_options = list('01 -fno-soa') eliminates the slowdown of the mike method!

Now, as I understand it, the thing we’re turning off with -fno-soa is the struct-of-arrays optimization, which presumably has a motivation and monolithically disabling it should lose its intended benefits, no?

If so, I wonder if there’s something about my implementation of my core idea (using indexing to reduce redundancies in the compute) that is triggering this need to disable SOA and if there might be an alternative implementation that avoids it?

Yes, a large portion of the benefit seen by using --O1 at all is probably due to the SoA optimization, especially if you have parameters which are matrices or vectors. They’re generally preferrable, except for if you’re doing a lot of indexing and assignment, like in a loop.

It seems like the compiler should probably be marking out in your code as a bad candidate, but it does not.

1 Like

In the mean time, if you’d like to prevent SoA-ing only the your function and it’s arguments, you can use a dummy call to a function which isn’t (and probably never will be) SoA compatible, like append_col.
E.g., change the declaration of out to be row_vector[c] out = append_col(zeros, []);

This should prevent out from ever being considered for SoA promotion (note: this also prevents beta from being SoA).

Whether or not this is a “good idea” will depend on the rest of the model, since depending on how beta is used, the fact that it is not SoA could be more of a slowdown than this method is a speedup. It’s worth noting that, at least for now, columns_dot_product is also not SoA compatible, so beta in this kind of model will be AoS already.

Now that we know this was the cause, it almost seems obvious in hindsight. Replacing multiplications with indexing will be faster for doubles, and probably faster for vars, but slower for var<matrix> (SoA) types because indexing and assigning is slower for them than a multiply should be

2 Likes

Sorry I’m just getting caught up to speed on this. I’ll try to take a look at the example and code this weekend. Very possible I’m missing a piece somewhere in the SoA optimization where this should get demoted.