So the third option does:
transformed data{
...
//uW_rows_repeated_per_subj: an array where each element is a row from uW repeated for num_subj rows
matrix[num_subj,num_cols_uW] uW_rows_repeated_per_subj[num_rows_uW] ;
for(row in 1:num_rows_uW){
uW_rows_repeated_per_subj[row] = rep_matrix(uW[row],num_subj) ;
}
}
...
transformed parameters{
...
matrix[num_rows_uW,num_subj] value_for_subj_cond3 ;
profile("v3"){
// Loop over conditions to compute unique entries in design matrix
for(this_condition in 1:num_rows_uW){
value_for_subj_cond3[this_condition] = transpose(
rows_dot_product(
subj_coef_
, uW_rows_repeated_per_subj[this_condition]
)
);
}
}
}
Which suggests a fourth variant where the entries in uW_rows_repeated_per_subj
are row-bound together into a single matrix and ditto subj_coef_ is broadcast to repeat it’s entries num_rows_uW
times, yielding the one-liner:
vector[num_rows_uW*num_subj] value_for_subj_cond4 ;
profile("v4"){
value_for_subj_cond4 = rows_dot_product(
subj_coef_[subj_index_for_uW_repeated_by_subj]
, uW_rows_repeated_per_subj_as_matrix
);
}
I expected this would be about the same performance as v3 (with the hope that the coming speedups @rok_cesnovar mentioned would make v4 faster than v3), but v4 actually runs substantially slower than v3. It’s not as slow as v2, but still consistent and mostly in the forward_time
. So I guess the broadcasting to create a bigger matrix is more costly (at present) than a loop with no broadcasting. It’ll be interesting to see if this difference persists once the new speedups arrive.
(attaching the updated scripts that include v4:
hwg_compare.stan (4.6 KB)
hwg_sim_and_fit.r (9.7 KB)
helper_functions.r (4.9 KB)
)