Rows_dot_product is intended as a shorthand rather than compute optimization, right?

So the language has rows_dot_product, which allows one to take this code (common in hierarchical models with contrast matrices) with a double-loop:

matrix[K,N] Q ;
for(n in 1:N){
  for(k in 1:K){
    Q[k,n] = dot_product(P[n],X[k]) ;
  }
}

and get rid of the innermost loop:

matrix[K,N] Q ;
for(n in 1:N){
  Q[,n] = rows_dot_product(rep_matrix(P[n],K),X) ;
}

In my testing I haven’t actually found the rows_dot_product version to be any faster when it comes to actually doing the computations, so I wanted to double-check that it isn’t intended to achieve any, simply be a slight shorthand. (Though, if so, it doesn’t even seem to be much less in terms of typing to actually use it [literally identical character count if you exclude spaces & newlines] and arguably somewhat obfuscates the model)

Interesting to know the outcome of this thread. If you make a dummy example with increasing row and col size you could profile it.

Remove that rep_matrix as that will be slow. Instead make an array of matrices or something before entering the loop.

Yes, certainly best practice and facilitated by the awesome new profiling. I just wanted to check if there was existing knowledge from the devs whether superior compute performance was indeed expected for rows_dot_product given its original design/intent, in which case I’d do real benchmarking (c.f. the casual benchmarking I’ve done when simply comparing to 2-loops it particular applied contexts).

Hm, I’m not sure that’s possible, at least in the typical use case where P is an N\times J matrix of parameters, so not available in the transformed data section for pre-computation. Or am I missing something?

If P is just a matrix then you don’t need the rep matrix. Remove that and just call

Q[,n] = rows_dot_product(P,X) ;

Ah, I can see that my attempt to reduce things to something minimal for easy visual comparison has obfuscated important structure. Here’s a full example, showing that you can’t just use the whole P matrix, but need to get a separate matrix per level of the N that we’re looping over. The model below uses more verbose-but-meaningful names for things, so in the verbiage there the subj_coef matrix (previously P) has a row for each subject containing coefficients, and the uW matrix (previously X) represents a contrast matrix with a column for each coefficient and the rows reflect all unique combinations of the levels of each contrast. So to use rows_dot_product for a given subject, you need to first repeat that subject’s coefficient row_vector to make a matrix with as many rows as there are rows in the uW matrix.

data{

  // num_obs_total: number of trials
  int<lower=1> num_obs_total ;

  // obs: observation on each trial
  vector[num_obs_total] obs ;

  // num_subj: number of subj
  int<lower=1> num_subj ;

  // num_rows_uW: num rows in uW
  int<lower=1> num_rows_uW ;

  // num_cols_uW: num cols in uW
  int<lower=1> num_cols_uW ;

  // uW: unique entries in the within predictor matrix
  matrix[num_rows_uW,num_cols_uW] uW ;

  // index: index of each trial in flattened subject-by-condition value matrix
  int obs_index[num_obs_total] ;

}

parameters{

  // chol_corr: population-level correlations (on cholesky factor scale) amongst within-subject predictors
  cholesky_factor_corr[num_cols_uW] chol_corr ;

  // mean_coef: mean (across subj) for each coefficient
  row_vector[num_cols_uW] mean_coef ;

  // sd_coef: sd (across subj) for each coefficient
  vector<lower=0>[num_cols_uW] sd_coef ;

  // multi_normal_helper: a helper variable for implementing non-centered parameterization
  matrix[num_cols_uW,num_subj] multi_normal_helper ;

  // noise: measurement noise
  real<lower=0> noise ;

}
model{

  ////
  // Priors
  ////

  // multi_normal_helper must have normal(0,1) prior for non-centered parameterization
  to_vector(multi_normal_helper) ~ std_normal() ;

  // relatively flat prior on correlations
  chol_corr ~ lkj_corr_cholesky(2) ;

  // independent (unpooled) normal(0,1) priors on all elements of sd_coef
  sd_coef ~ std_normal() ;

  // independent (unpooled) normal(0,1) priors on all elements of mean_coef
  mean_coef ~ std_normal() ;

  // low-near-zero prior on measurement noise
  noise ~ weibull(2,1) ; // weibull(2,1) is peaked around .8

  // compute coefficients for each subject/condition
  matrix[num_subj,num_cols_uW] subj_coef = (
    rep_matrix(mean_coef,num_subj)
    + transpose(
      diag_pre_multiply(sd_coef,chol_corr)
      * multi_normal_helper
    )
  ) ;

  // Loop over subj and conditions to compute unique entries in design matrix
  matrix[num_rows_uW,num_subj] value_for_subj_cond ;
  for(this_subj in 1:num_subj){
    for(this_condition in 1:num_rows_uW){
      value_for_subj_cond[this_condition,this_subj] = dot_product(
        subj_coef[this_subj]
        , uW[this_condition]
      ) ;
    }
    // // slightly less explicit but equally fast:
    // value_for_subj_cond[,this_subj] = rows_dot_product(
    //  rep_matrix(
    //    subj_coef[this_subj]
    //    , num_rows_uW
    //  )
    //  , uW
    // ) ;
  }

  // Likelihood
  obs ~ normal(
    to_vector(value_for_subj_cond)[obs_index]
    , noise
  ) ;

}

wait, maybe if instead of looping over subjects and using rows_dot_product within a subject to get the values for all conditions implied by that subject’s coefficients and the contrasts embodied by the rows of uW, I could loop over conditions, permitting using the entire subj_coef matrix as is and pre-computing an array in transformed data with each entry consisting of a matrix of a row from uW repeated for num_subjects rows!

1 Like

Right now in Stan 2.26 this is just a shorthand. There is no custom gradient for rows_dot_product: math/rows_dot_product.hpp at develop · stan-dev/math · GitHub

However, with stuff coming along in 2.27 this should get A LOT faster for some cases. Someone could also write a custom gradient that would provide a speedup for all cases.

Its a similar story with vectorized code: non-lpdf/lpmf functions do not have much speedup when used vectorized right now, but there is work in the pipeline that will improve performance significantly, very likely in 2.27.

2 Likes

FYI, this definitely is the fastest of the 3 variants. Working up a reprex to explore varying combos of num_subj and num_cols_uW, but it seems to be across the board.

Some timing results (v1 = two-loops, v2= looping over subjects, v3 = looping over conditions):

num_subj:10
num_cols_uW:4
  vers   mean     sd   min   max
* <chr> <dbl>  <dbl> <dbl> <dbl>
1 v1    0.561 0.0336 0.528 0.612
2 v2    0.867 0.0542 0.805 0.940
3 v3    0.522 0.0316 0.490 0.565

num_subj:100
num_cols_uW:4
  vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     4.10 0.245  3.69  4.32
2 v2     6.42 0.386  5.76  6.77
3 v3     3.57 0.212  3.22  3.76

num_subj:10
num_cols_uW:16
  vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     12.3 0.414  11.8  12.9
2 v2     30.2 0.999  29.2  31.6
3 v3     12.1 0.419  11.7  12.7

num_subj:100
num_cols_uW:8
  vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     8.14  1.12  7.30  10.4
2 v2    15.9   2.23 14.3   20.4
3 v3     7.98  1.10  7.25  10.2
2 Likes

Attached is the model that profiles the 3 variants and R code for simulating data.
hwg_compare.stan (4.2 KB)
hwg_sim_and_fit.r (9.2 KB)
helper_functions.r (4.9 KB)

So the third option does:

transformed data{
	...
	//uW_rows_repeated_per_subj: an array where each element is a row from uW repeated for num_subj rows
	matrix[num_subj,num_cols_uW] uW_rows_repeated_per_subj[num_rows_uW] ;
	for(row in 1:num_rows_uW){
		uW_rows_repeated_per_subj[row] = rep_matrix(uW[row],num_subj) ;
	}
}
...
transformed parameters{
	...
	matrix[num_rows_uW,num_subj] value_for_subj_cond3 ;
	profile("v3"){
		// Loop over conditions to compute unique entries in design matrix
		for(this_condition in 1:num_rows_uW){
			value_for_subj_cond3[this_condition] = transpose(
				rows_dot_product(
					subj_coef_
					, uW_rows_repeated_per_subj[this_condition]
				)
			);
		}
	}
}

Which suggests a fourth variant where the entries in uW_rows_repeated_per_subj are row-bound together into a single matrix and ditto subj_coef_ is broadcast to repeat it’s entries num_rows_uW times, yielding the one-liner:

	vector[num_rows_uW*num_subj] value_for_subj_cond4 ;
	profile("v4"){
		value_for_subj_cond4 = rows_dot_product(
			subj_coef_[subj_index_for_uW_repeated_by_subj]
			, uW_rows_repeated_per_subj_as_matrix
		);
	}

I expected this would be about the same performance as v3 (with the hope that the coming speedups @rok_cesnovar mentioned would make v4 faster than v3), but v4 actually runs substantially slower than v3. It’s not as slow as v2, but still consistent and mostly in the forward_time. So I guess the broadcasting to create a bigger matrix is more costly (at present) than a loop with no broadcasting. It’ll be interesting to see if this difference persists once the new speedups arrive.

(attaching the updated scripts that include v4:
hwg_compare.stan (4.6 KB)
hwg_sim_and_fit.r (9.7 KB)
helper_functions.r (4.9 KB)
)