Rows_dot_product is intended as a shorthand rather than compute optimization, right?

So the language has rows_dot_product, which allows one to take this code (common in hierarchical models with contrast matrices) with a double-loop:

matrix[K,N] Q ;
for(n in 1:N){
  for(k in 1:K){
    Q[k,n] = dot_product(P[n],X[k]) ;

and get rid of the innermost loop:

matrix[K,N] Q ;
for(n in 1:N){
  Q[,n] = rows_dot_product(rep_matrix(P[n],K),X) ;

In my testing I haven’t actually found the rows_dot_product version to be any faster when it comes to actually doing the computations, so I wanted to double-check that it isn’t intended to achieve any, simply be a slight shorthand. (Though, if so, it doesn’t even seem to be much less in terms of typing to actually use it [literally identical character count if you exclude spaces & newlines] and arguably somewhat obfuscates the model)

Interesting to know the outcome of this thread. If you make a dummy example with increasing row and col size you could profile it.

Remove that rep_matrix as that will be slow. Instead make an array of matrices or something before entering the loop.

Yes, certainly best practice and facilitated by the awesome new profiling. I just wanted to check if there was existing knowledge from the devs whether superior compute performance was indeed expected for rows_dot_product given its original design/intent, in which case I’d do real benchmarking (c.f. the casual benchmarking I’ve done when simply comparing to 2-loops it particular applied contexts).

Hm, I’m not sure that’s possible, at least in the typical use case where P is an N\times J matrix of parameters, so not available in the transformed data section for pre-computation. Or am I missing something?

If P is just a matrix then you don’t need the rep matrix. Remove that and just call

Q[,n] = rows_dot_product(P,X) ;

Ah, I can see that my attempt to reduce things to something minimal for easy visual comparison has obfuscated important structure. Here’s a full example, showing that you can’t just use the whole P matrix, but need to get a separate matrix per level of the N that we’re looping over. The model below uses more verbose-but-meaningful names for things, so in the verbiage there the subj_coef matrix (previously P) has a row for each subject containing coefficients, and the uW matrix (previously X) represents a contrast matrix with a column for each coefficient and the rows reflect all unique combinations of the levels of each contrast. So to use rows_dot_product for a given subject, you need to first repeat that subject’s coefficient row_vector to make a matrix with as many rows as there are rows in the uW matrix.


  // num_obs_total: number of trials
  int<lower=1> num_obs_total ;

  // obs: observation on each trial
  vector[num_obs_total] obs ;

  // num_subj: number of subj
  int<lower=1> num_subj ;

  // num_rows_uW: num rows in uW
  int<lower=1> num_rows_uW ;

  // num_cols_uW: num cols in uW
  int<lower=1> num_cols_uW ;

  // uW: unique entries in the within predictor matrix
  matrix[num_rows_uW,num_cols_uW] uW ;

  // index: index of each trial in flattened subject-by-condition value matrix
  int obs_index[num_obs_total] ;



  // chol_corr: population-level correlations (on cholesky factor scale) amongst within-subject predictors
  cholesky_factor_corr[num_cols_uW] chol_corr ;

  // mean_coef: mean (across subj) for each coefficient
  row_vector[num_cols_uW] mean_coef ;

  // sd_coef: sd (across subj) for each coefficient
  vector<lower=0>[num_cols_uW] sd_coef ;

  // multi_normal_helper: a helper variable for implementing non-centered parameterization
  matrix[num_cols_uW,num_subj] multi_normal_helper ;

  // noise: measurement noise
  real<lower=0> noise ;


  // Priors

  // multi_normal_helper must have normal(0,1) prior for non-centered parameterization
  to_vector(multi_normal_helper) ~ std_normal() ;

  // relatively flat prior on correlations
  chol_corr ~ lkj_corr_cholesky(2) ;

  // independent (unpooled) normal(0,1) priors on all elements of sd_coef
  sd_coef ~ std_normal() ;

  // independent (unpooled) normal(0,1) priors on all elements of mean_coef
  mean_coef ~ std_normal() ;

  // low-near-zero prior on measurement noise
  noise ~ weibull(2,1) ; // weibull(2,1) is peaked around .8

  // compute coefficients for each subject/condition
  matrix[num_subj,num_cols_uW] subj_coef = (
    + transpose(
      * multi_normal_helper
  ) ;

  // Loop over subj and conditions to compute unique entries in design matrix
  matrix[num_rows_uW,num_subj] value_for_subj_cond ;
  for(this_subj in 1:num_subj){
    for(this_condition in 1:num_rows_uW){
      value_for_subj_cond[this_condition,this_subj] = dot_product(
        , uW[this_condition]
      ) ;
    // // slightly less explicit but equally fast:
    // value_for_subj_cond[,this_subj] = rows_dot_product(
    //  rep_matrix(
    //    subj_coef[this_subj]
    //    , num_rows_uW
    //  )
    //  , uW
    // ) ;

  // Likelihood
  obs ~ normal(
    , noise
  ) ;


wait, maybe if instead of looping over subjects and using rows_dot_product within a subject to get the values for all conditions implied by that subject’s coefficients and the contrasts embodied by the rows of uW, I could loop over conditions, permitting using the entire subj_coef matrix as is and pre-computing an array in transformed data with each entry consisting of a matrix of a row from uW repeated for num_subjects rows!

1 Like

Right now in Stan 2.26 this is just a shorthand. There is no custom gradient for rows_dot_product: math/rows_dot_product.hpp at develop · stan-dev/math · GitHub

However, with stuff coming along in 2.27 this should get A LOT faster for some cases. Someone could also write a custom gradient that would provide a speedup for all cases.

Its a similar story with vectorized code: non-lpdf/lpmf functions do not have much speedup when used vectorized right now, but there is work in the pipeline that will improve performance significantly, very likely in 2.27.


FYI, this definitely is the fastest of the 3 variants. Working up a reprex to explore varying combos of num_subj and num_cols_uW, but it seems to be across the board.

Some timing results (v1 = two-loops, v2= looping over subjects, v3 = looping over conditions):

  vers   mean     sd   min   max
* <chr> <dbl>  <dbl> <dbl> <dbl>
1 v1    0.561 0.0336 0.528 0.612
2 v2    0.867 0.0542 0.805 0.940
3 v3    0.522 0.0316 0.490 0.565

  vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     4.10 0.245  3.69  4.32
2 v2     6.42 0.386  5.76  6.77
3 v3     3.57 0.212  3.22  3.76

  vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     12.3 0.414  11.8  12.9
2 v2     30.2 0.999  29.2  31.6
3 v3     12.1 0.419  11.7  12.7

  vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     8.14  1.12  7.30  10.4
2 v2    15.9   2.23 14.3   20.4
3 v3     7.98  1.10  7.25  10.2

Attached is the model that profiles the 3 variants and R code for simulating data.
hwg_compare.stan (4.2 KB)
hwg_sim_and_fit.r (9.2 KB)
helper_functions.r (4.9 KB)

So the third option does:

transformed data{
	//uW_rows_repeated_per_subj: an array where each element is a row from uW repeated for num_subj rows
	matrix[num_subj,num_cols_uW] uW_rows_repeated_per_subj[num_rows_uW] ;
	for(row in 1:num_rows_uW){
		uW_rows_repeated_per_subj[row] = rep_matrix(uW[row],num_subj) ;
transformed parameters{
	matrix[num_rows_uW,num_subj] value_for_subj_cond3 ;
		// Loop over conditions to compute unique entries in design matrix
		for(this_condition in 1:num_rows_uW){
			value_for_subj_cond3[this_condition] = transpose(
					, uW_rows_repeated_per_subj[this_condition]

Which suggests a fourth variant where the entries in uW_rows_repeated_per_subj are row-bound together into a single matrix and ditto subj_coef_ is broadcast to repeat it’s entries num_rows_uW times, yielding the one-liner:

	vector[num_rows_uW*num_subj] value_for_subj_cond4 ;
		value_for_subj_cond4 = rows_dot_product(
			, uW_rows_repeated_per_subj_as_matrix

I expected this would be about the same performance as v3 (with the hope that the coming speedups @rok_cesnovar mentioned would make v4 faster than v3), but v4 actually runs substantially slower than v3. It’s not as slow as v2, but still consistent and mostly in the forward_time. So I guess the broadcasting to create a bigger matrix is more costly (at present) than a loop with no broadcasting. It’ll be interesting to see if this difference persists once the new speedups arrive.

(attaching the updated scripts that include v4:
hwg_compare.stan (4.6 KB)
hwg_sim_and_fit.r (9.7 KB)
helper_functions.r (4.9 KB)