In my testing I haven’t actually found the rows_dot_product version to be any faster when it comes to actually doing the computations, so I wanted to double-check that it isn’t intended to achieve any, simply be a slight shorthand. (Though, if so, it doesn’t even seem to be much less in terms of typing to actually use it [literally identical character count if you exclude spaces & newlines] and arguably somewhat obfuscates the model)

Yes, certainly best practice and facilitated by the awesome new profiling. I just wanted to check if there was existing knowledge from the devs whether superior compute performance was indeed expected for rows_dot_product given its original design/intent, in which case I’d do real benchmarking (c.f. the casual benchmarking I’ve done when simply comparing to 2-loops it particular applied contexts).

Hm, I’m not sure that’s possible, at least in the typical use case where P is an N\times J matrix of parameters, so not available in the transformed data section for pre-computation. Or am I missing something?

Ah, I can see that my attempt to reduce things to something minimal for easy visual comparison has obfuscated important structure. Here’s a full example, showing that you can’t just use the whole P matrix, but need to get a separate matrix per level of the N that we’re looping over. The model below uses more verbose-but-meaningful names for things, so in the verbiage there the subj_coef matrix (previously P) has a row for each subject containing coefficients, and the uW matrix (previously X) represents a contrast matrix with a column for each coefficient and the rows reflect all unique combinations of the levels of each contrast. So to use rows_dot_product for a given subject, you need to first repeat that subject’s coefficient row_vector to make a matrix with as many rows as there are rows in the uW matrix.

data{
// num_obs_total: number of trials
int<lower=1> num_obs_total ;
// obs: observation on each trial
vector[num_obs_total] obs ;
// num_subj: number of subj
int<lower=1> num_subj ;
// num_rows_uW: num rows in uW
int<lower=1> num_rows_uW ;
// num_cols_uW: num cols in uW
int<lower=1> num_cols_uW ;
// uW: unique entries in the within predictor matrix
matrix[num_rows_uW,num_cols_uW] uW ;
// index: index of each trial in flattened subject-by-condition value matrix
int obs_index[num_obs_total] ;
}
parameters{
// chol_corr: population-level correlations (on cholesky factor scale) amongst within-subject predictors
cholesky_factor_corr[num_cols_uW] chol_corr ;
// mean_coef: mean (across subj) for each coefficient
row_vector[num_cols_uW] mean_coef ;
// sd_coef: sd (across subj) for each coefficient
vector<lower=0>[num_cols_uW] sd_coef ;
// multi_normal_helper: a helper variable for implementing non-centered parameterization
matrix[num_cols_uW,num_subj] multi_normal_helper ;
// noise: measurement noise
real<lower=0> noise ;
}
model{
////
// Priors
////
// multi_normal_helper must have normal(0,1) prior for non-centered parameterization
to_vector(multi_normal_helper) ~ std_normal() ;
// relatively flat prior on correlations
chol_corr ~ lkj_corr_cholesky(2) ;
// independent (unpooled) normal(0,1) priors on all elements of sd_coef
sd_coef ~ std_normal() ;
// independent (unpooled) normal(0,1) priors on all elements of mean_coef
mean_coef ~ std_normal() ;
// low-near-zero prior on measurement noise
noise ~ weibull(2,1) ; // weibull(2,1) is peaked around .8
// compute coefficients for each subject/condition
matrix[num_subj,num_cols_uW] subj_coef = (
rep_matrix(mean_coef,num_subj)
+ transpose(
diag_pre_multiply(sd_coef,chol_corr)
* multi_normal_helper
)
) ;
// Loop over subj and conditions to compute unique entries in design matrix
matrix[num_rows_uW,num_subj] value_for_subj_cond ;
for(this_subj in 1:num_subj){
for(this_condition in 1:num_rows_uW){
value_for_subj_cond[this_condition,this_subj] = dot_product(
subj_coef[this_subj]
, uW[this_condition]
) ;
}
// // slightly less explicit but equally fast:
// value_for_subj_cond[,this_subj] = rows_dot_product(
// rep_matrix(
// subj_coef[this_subj]
// , num_rows_uW
// )
// , uW
// ) ;
}
// Likelihood
obs ~ normal(
to_vector(value_for_subj_cond)[obs_index]
, noise
) ;
}

wait, maybe if instead of looping over subjects and using rows_dot_product within a subject to get the values for all conditions implied by that subject’s coefficients and the contrasts embodied by the rows of uW, I could loop over conditions, permitting using the entire subj_coef matrix as is and pre-computing an array in transformed data with each entry consisting of a matrix of a row from uW repeated for num_subjects rows!

However, with stuff coming along in 2.27 this should get A LOT faster for some cases. Someone could also write a custom gradient that would provide a speedup for all cases.

Its a similar story with vectorized code: non-lpdf/lpmf functions do not have much speedup when used vectorized right now, but there is work in the pipeline that will improve performance significantly, very likely in 2.27.

FYI, this definitely is the fastest of the 3 variants. Working up a reprex to explore varying combos of num_subj and num_cols_uW, but it seems to be across the board.

transformed data{
...
//uW_rows_repeated_per_subj: an array where each element is a row from uW repeated for num_subj rows
matrix[num_subj,num_cols_uW] uW_rows_repeated_per_subj[num_rows_uW] ;
for(row in 1:num_rows_uW){
uW_rows_repeated_per_subj[row] = rep_matrix(uW[row],num_subj) ;
}
}
...
transformed parameters{
...
matrix[num_rows_uW,num_subj] value_for_subj_cond3 ;
profile("v3"){
// Loop over conditions to compute unique entries in design matrix
for(this_condition in 1:num_rows_uW){
value_for_subj_cond3[this_condition] = transpose(
rows_dot_product(
subj_coef_
, uW_rows_repeated_per_subj[this_condition]
)
);
}
}
}

Which suggests a fourth variant where the entries in uW_rows_repeated_per_subj are row-bound together into a single matrix and ditto subj_coef_ is broadcast to repeat it’s entries num_rows_uW times, yielding the one-liner:

I expected this would be about the same performance as v3 (with the hope that the coming speedups @rok_cesnovar mentioned would make v4 faster than v3), but v4 actually runs substantially slower than v3. It’s not as slow as v2, but still consistent and mostly in the forward_time. So I guess the broadcasting to create a bigger matrix is more costly (at present) than a loop with no broadcasting. It’ll be interesting to see if this difference persists once the new speedups arrive.