Measuring and comparing computational performance in Stan with different compilation alternatives. Using reduce_sum does not bring any advantage?

Sounds great. I am a bit confused about openmpi… reduce_sum does not use MPI at all, so the difference between 3&4 vs 1&2 is noise? Or what did MPI do here?

It would be interesting to see scaling results. So do for 1 a run with 1,2,4,6,8,10 cores (or even higher) and observe how much speedup you get relative to 1 core runtime (ideally vs a 1-core runtime where you do not use reduce_sum nor threading at all).

yes, great idea. I will launch the several threads test on the fastest models that I obtained, and add them to the csv file.

Btw, got confused about your claim that reduce_sum does not use open MPI. As far as I understood, Stan uses MPI for within chain parallelization; and as far as I understood within chain parallelization is done either through reduce_sum or map functions. Could you elaborate on this please?

MPI is only used for map_rect and nothing else.

Reduce_sum can only take advantage of threads.

okay nice. Actually I got surprised that the results using openmpi where only “slightly” better. But as you point out it has to be noise.

Actually, one of the interesting things from this time comparison is that row vs column matrix indexing does not always make a difference, as I expected before launching this. In some settings column indexing is clearly better, although in other settings row indexing is either much better or slightly better. One could think it might be masked by the memory overhead of copying a big set of parameters into the reduce_sum thread. However, the result shown by the experiment partial_sum_SLICED_ARGS_logit_SHARED_ARGS_y (which passes the parameters as a sliced argument and the labels, which is data, as reference) takes more time than the same model where row indexing is performed (partial_sum_SLICED_ARGS_logit_SHARED_ARGS_y_row_indexing_and_transposition). The difference here is that the logit matrix is either indexed by rows and then transposed, or directly indexed by columns (which should be faster).

Will elaborate a bit more on this claims once I upload to Github.

1 Like

Indexing is a bit of a mystery in terms of what is fastest.

Have a look at the brms auto generated code for models with random effects. You will be surprised by the many copies of the parameters which are made to loop over these… which is faster. I have myself once found that looping over real arrays can be faster than looping over vectors. I have not followed up on these things, but if you have the means… go for it.

The vignette in brms about within-chain parallelization is also a worthwhile read to get to know some details of reduce_sum.

Brms = R package on CRAN in case that is unfamiliar…

Previously this was because indexing over arrays avoided an out of bounds check, but now we do that so they should be the same speed

Thank you. I have already looked into some stan code generated by brms so will take a closer look at these things.

On the other hand, the purpose of doing this checking was getting familiar with Stan models and how to make them fast (I am new to Stan). So in the mid future I wont probably follow up on these things. But I will have all these in an open access repo for people to follow up if they are interested.

@wds15 I have finally pushed everything into Github so that it does not get lost in my computer.

I have also added a hierarchical Bayesian Neural Net as that used by Radford Neal in his phd thesis, for anyone interested Machine.Learning.Models.pytorch/Bayesian_Neural_Net_categorical_GLM_no_partial_sum.stan at master · jmaronas/Machine.Learning.Models.pytorch · GitHub

Great. Skinning over the neural net model I see an fmax function…that is really bad for the performance of nuts which is based on gradients. Can you replace that with some continuous function possibly?

Yeah, I saw that performance warning in the documentation of the fmax function. However the implementation considers state of the art activation functions commonly used in Deep NNets, so that is the reason why I coded up using a Rectifier linear unit activation.

Why is that so bar for gradients? A couple of years ago I remember coding up some cuda kernels for computing forward and backward of fmax type functions used in the aforementioned ReLu, and I remember performance wasn’t that bad. Here is the code of the implementation: CULAYERS/gpu_kernels.cu at 2597fca8c0a93102ef0e259d306736e730a798a0 · jmaronas/CULAYERS · GitHub

fmax implies a discontinuity. You should be able to replace that with a logit type step function which you can tune in it’s steepness.