Sounds great. I am a bit confused about openmpi… reduce_sum does not use MPI at all, so the difference between 3&4 vs 1&2 is noise? Or what did MPI do here?
It would be interesting to see scaling results. So do for 1 a run with 1,2,4,6,8,10 cores (or even higher) and observe how much speedup you get relative to 1 core runtime (ideally vs a 1-core runtime where you do not use reduce_sum nor threading at all).
yes, great idea. I will launch the several threads test on the fastest models that I obtained, and add them to the csv file.
Btw, got confused about your claim that reduce_sum does not use open MPI. As far as I understood, Stan uses MPI for within chain parallelization; and as far as I understood within chain parallelization is done either through reduce_sum or map functions. Could you elaborate on this please?
okay nice. Actually I got surprised that the results using openmpi where only “slightly” better. But as you point out it has to be noise.
Actually, one of the interesting things from this time comparison is that row vs column matrix indexing does not always make a difference, as I expected before launching this. In some settings column indexing is clearly better, although in other settings row indexing is either much better or slightly better. One could think it might be masked by the memory overhead of copying a big set of parameters into the reduce_sum thread. However, the result shown by the experiment partial_sum_SLICED_ARGS_logit_SHARED_ARGS_y (which passes the parameters as a sliced argument and the labels, which is data, as reference) takes more time than the same model where row indexing is performed (partial_sum_SLICED_ARGS_logit_SHARED_ARGS_y_row_indexing_and_transposition). The difference here is that the logit matrix is either indexed by rows and then transposed, or directly indexed by columns (which should be faster).
Will elaborate a bit more on this claims once I upload to Github.
Indexing is a bit of a mystery in terms of what is fastest.
Have a look at the brms auto generated code for models with random effects. You will be surprised by the many copies of the parameters which are made to loop over these… which is faster. I have myself once found that looping over real arrays can be faster than looping over vectors. I have not followed up on these things, but if you have the means… go for it.
The vignette in brms about within-chain parallelization is also a worthwhile read to get to know some details of reduce_sum.
Brms = R package on CRAN in case that is unfamiliar…
Thank you. I have already looked into some stan code generated by brms so will take a closer look at these things.
On the other hand, the purpose of doing this checking was getting familiar with Stan models and how to make them fast (I am new to Stan). So in the mid future I wont probably follow up on these things. But I will have all these in an open access repo for people to follow up if they are interested.
Great. Skinning over the neural net model I see an fmax function…that is really bad for the performance of nuts which is based on gradients. Can you replace that with some continuous function possibly?
Yeah, I saw that performance warning in the documentation of the fmax function. However the implementation considers state of the art activation functions commonly used in Deep NNets, so that is the reason why I coded up using a Rectifier linear unit activation.