# Rows_dot_product is intended as a shorthand rather than compute optimization, right?

So the language has rows_dot_product, which allows one to take this code (common in hierarchical models with contrast matrices) with a double-loop:

matrix[K,N] Q ;
for(n in 1:N){
for(k in 1:K){
Q[k,n] = dot_product(P[n],X[k]) ;
}
}


and get rid of the innermost loop:

matrix[K,N] Q ;
for(n in 1:N){
Q[,n] = rows_dot_product(rep_matrix(P[n],K),X) ;
}


In my testing I haven’t actually found the rows_dot_product version to be any faster when it comes to actually doing the computations, so I wanted to double-check that it isn’t intended to achieve any, simply be a slight shorthand. (Though, if so, it doesn’t even seem to be much less in terms of typing to actually use it [literally identical character count if you exclude spaces & newlines] and arguably somewhat obfuscates the model)

Interesting to know the outcome of this thread. If you make a dummy example with increasing row and col size you could profile it.

Remove that rep_matrix as that will be slow. Instead make an array of matrices or something before entering the loop.

Yes, certainly best practice and facilitated by the awesome new profiling. I just wanted to check if there was existing knowledge from the devs whether superior compute performance was indeed expected for rows_dot_product given its original design/intent, in which case I’d do real benchmarking (c.f. the casual benchmarking I’ve done when simply comparing to 2-loops it particular applied contexts).

Hm, I’m not sure that’s possible, at least in the typical use case where P is an N\times J matrix of parameters, so not available in the transformed data section for pre-computation. Or am I missing something?

If P is just a matrix then you don’t need the rep matrix. Remove that and just call

Q[,n] = rows_dot_product(P,X) ;

Ah, I can see that my attempt to reduce things to something minimal for easy visual comparison has obfuscated important structure. Here’s a full example, showing that you can’t just use the whole P matrix, but need to get a separate matrix per level of the N that we’re looping over. The model below uses more verbose-but-meaningful names for things, so in the verbiage there the subj_coef matrix (previously P) has a row for each subject containing coefficients, and the uW matrix (previously X) represents a contrast matrix with a column for each coefficient and the rows reflect all unique combinations of the levels of each contrast. So to use rows_dot_product for a given subject, you need to first repeat that subject’s coefficient row_vector to make a matrix with as many rows as there are rows in the uW matrix.

data{

// num_obs_total: number of trials
int<lower=1> num_obs_total ;

// obs: observation on each trial
vector[num_obs_total] obs ;

// num_subj: number of subj
int<lower=1> num_subj ;

// num_rows_uW: num rows in uW
int<lower=1> num_rows_uW ;

// num_cols_uW: num cols in uW
int<lower=1> num_cols_uW ;

// uW: unique entries in the within predictor matrix
matrix[num_rows_uW,num_cols_uW] uW ;

// index: index of each trial in flattened subject-by-condition value matrix
int obs_index[num_obs_total] ;

}

parameters{

// chol_corr: population-level correlations (on cholesky factor scale) amongst within-subject predictors
cholesky_factor_corr[num_cols_uW] chol_corr ;

// mean_coef: mean (across subj) for each coefficient
row_vector[num_cols_uW] mean_coef ;

// sd_coef: sd (across subj) for each coefficient
vector<lower=0>[num_cols_uW] sd_coef ;

// multi_normal_helper: a helper variable for implementing non-centered parameterization
matrix[num_cols_uW,num_subj] multi_normal_helper ;

// noise: measurement noise
real<lower=0> noise ;

}
model{

////
// Priors
////

// multi_normal_helper must have normal(0,1) prior for non-centered parameterization
to_vector(multi_normal_helper) ~ std_normal() ;

// relatively flat prior on correlations
chol_corr ~ lkj_corr_cholesky(2) ;

// independent (unpooled) normal(0,1) priors on all elements of sd_coef
sd_coef ~ std_normal() ;

// independent (unpooled) normal(0,1) priors on all elements of mean_coef
mean_coef ~ std_normal() ;

// low-near-zero prior on measurement noise
noise ~ weibull(2,1) ; // weibull(2,1) is peaked around .8

// compute coefficients for each subject/condition
matrix[num_subj,num_cols_uW] subj_coef = (
rep_matrix(mean_coef,num_subj)
+ transpose(
diag_pre_multiply(sd_coef,chol_corr)
* multi_normal_helper
)
) ;

// Loop over subj and conditions to compute unique entries in design matrix
matrix[num_rows_uW,num_subj] value_for_subj_cond ;
for(this_subj in 1:num_subj){
for(this_condition in 1:num_rows_uW){
value_for_subj_cond[this_condition,this_subj] = dot_product(
subj_coef[this_subj]
, uW[this_condition]
) ;
}
// // slightly less explicit but equally fast:
// value_for_subj_cond[,this_subj] = rows_dot_product(
//  rep_matrix(
//    subj_coef[this_subj]
//    , num_rows_uW
//  )
//  , uW
// ) ;
}

// Likelihood
obs ~ normal(
to_vector(value_for_subj_cond)[obs_index]
, noise
) ;

}


wait, maybe if instead of looping over subjects and using rows_dot_product within a subject to get the values for all conditions implied by that subject’s coefficients and the contrasts embodied by the rows of uW, I could loop over conditions, permitting using the entire subj_coef matrix as is and pre-computing an array in transformed data with each entry consisting of a matrix of a row from uW repeated for num_subjects rows!

1 Like

Right now in Stan 2.26 this is just a shorthand. There is no custom gradient for rows_dot_product: math/rows_dot_product.hpp at develop · stan-dev/math · GitHub

However, with stuff coming along in 2.27 this should get A LOT faster for some cases. Someone could also write a custom gradient that would provide a speedup for all cases.

Its a similar story with vectorized code: non-lpdf/lpmf functions do not have much speedup when used vectorized right now, but there is work in the pipeline that will improve performance significantly, very likely in 2.27.

2 Likes

FYI, this definitely is the fastest of the 3 variants. Working up a reprex to explore varying combos of num_subj and num_cols_uW, but it seems to be across the board.

Some timing results (v1 = two-loops, v2= looping over subjects, v3 = looping over conditions):

num_subj:10
num_cols_uW:4
vers   mean     sd   min   max
* <chr> <dbl>  <dbl> <dbl> <dbl>
1 v1    0.561 0.0336 0.528 0.612
2 v2    0.867 0.0542 0.805 0.940
3 v3    0.522 0.0316 0.490 0.565

num_subj:100
num_cols_uW:4
vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     4.10 0.245  3.69  4.32
2 v2     6.42 0.386  5.76  6.77
3 v3     3.57 0.212  3.22  3.76

num_subj:10
num_cols_uW:16
vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     12.3 0.414  11.8  12.9
2 v2     30.2 0.999  29.2  31.6
3 v3     12.1 0.419  11.7  12.7

num_subj:100
num_cols_uW:8
vers   mean    sd   min   max
* <chr> <dbl> <dbl> <dbl> <dbl>
1 v1     8.14  1.12  7.30  10.4
2 v2    15.9   2.23 14.3   20.4
3 v3     7.98  1.10  7.25  10.2

2 Likes

Attached is the model that profiles the 3 variants and R code for simulating data.
hwg_compare.stan (4.2 KB)
hwg_sim_and_fit.r (9.2 KB)
helper_functions.r (4.9 KB)

So the third option does:

transformed data{
...
//uW_rows_repeated_per_subj: an array where each element is a row from uW repeated for num_subj rows
matrix[num_subj,num_cols_uW] uW_rows_repeated_per_subj[num_rows_uW] ;
for(row in 1:num_rows_uW){
uW_rows_repeated_per_subj[row] = rep_matrix(uW[row],num_subj) ;
}
}
...
transformed parameters{
...
matrix[num_rows_uW,num_subj] value_for_subj_cond3 ;
profile("v3"){
// Loop over conditions to compute unique entries in design matrix
for(this_condition in 1:num_rows_uW){
value_for_subj_cond3[this_condition] = transpose(
rows_dot_product(
subj_coef_
, uW_rows_repeated_per_subj[this_condition]
)
);
}
}
}


Which suggests a fourth variant where the entries in uW_rows_repeated_per_subj are row-bound together into a single matrix and ditto subj_coef_ is broadcast to repeat it’s entries num_rows_uW times, yielding the one-liner:

	vector[num_rows_uW*num_subj] value_for_subj_cond4 ;
profile("v4"){
value_for_subj_cond4 = rows_dot_product(
subj_coef_[subj_index_for_uW_repeated_by_subj]
, uW_rows_repeated_per_subj_as_matrix
);
}


I expected this would be about the same performance as v3 (with the hope that the coming speedups @rok_cesnovar mentioned would make v4 faster than v3), but v4 actually runs substantially slower than v3. It’s not as slow as v2, but still consistent and mostly in the forward_time. So I guess the broadcasting to create a bigger matrix is more costly (at present) than a loop with no broadcasting. It’ll be interesting to see if this difference persists once the new speedups arrive.

(attaching the updated scripts that include v4:
hwg_compare.stan (4.6 KB)
hwg_sim_and_fit.r (9.7 KB)
helper_functions.r (4.9 KB)
)