Rows_dot_product is intended as a shorthand rather than compute optimization, right?

So the third option does:

transformed data{
	...
	//uW_rows_repeated_per_subj: an array where each element is a row from uW repeated for num_subj rows
	matrix[num_subj,num_cols_uW] uW_rows_repeated_per_subj[num_rows_uW] ;
	for(row in 1:num_rows_uW){
		uW_rows_repeated_per_subj[row] = rep_matrix(uW[row],num_subj) ;
	}
}
...
transformed parameters{
	...
	matrix[num_rows_uW,num_subj] value_for_subj_cond3 ;
	profile("v3"){
		// Loop over conditions to compute unique entries in design matrix
		for(this_condition in 1:num_rows_uW){
			value_for_subj_cond3[this_condition] = transpose(
				rows_dot_product(
					subj_coef_
					, uW_rows_repeated_per_subj[this_condition]
				)
			);
		}
	}
}

Which suggests a fourth variant where the entries in uW_rows_repeated_per_subj are row-bound together into a single matrix and ditto subj_coef_ is broadcast to repeat it’s entries num_rows_uW times, yielding the one-liner:

	vector[num_rows_uW*num_subj] value_for_subj_cond4 ;
	profile("v4"){
		value_for_subj_cond4 = rows_dot_product(
			subj_coef_[subj_index_for_uW_repeated_by_subj]
			, uW_rows_repeated_per_subj_as_matrix
		);
	}

I expected this would be about the same performance as v3 (with the hope that the coming speedups @rok_cesnovar mentioned would make v4 faster than v3), but v4 actually runs substantially slower than v3. It’s not as slow as v2, but still consistent and mostly in the forward_time. So I guess the broadcasting to create a bigger matrix is more costly (at present) than a loop with no broadcasting. It’ll be interesting to see if this difference persists once the new speedups arrive.

(attaching the updated scripts that include v4:
hwg_compare.stan (4.6 KB)
hwg_sim_and_fit.r (9.7 KB)
helper_functions.r (4.9 KB)
)