Measuring and comparing computational performance in Stan with different compilation alternatives. Using reduce_sum does not bring any advantage?

wds15 · March 24, 2021, 7:13pm

Sounds great. I am a bit confused about openmpi… reduce_sum does not use MPI at all, so the difference between 3&4 vs 1&2 is noise? Or what did MPI do here?

It would be interesting to see scaling results. So do for 1 a run with 1,2,4,6,8,10 cores (or even higher) and observe how much speedup you get relative to 1 core runtime (ideally vs a 1-core runtime where you do not use reduce_sum nor threading at all).

jmaronas · March 24, 2021, 7:20pm

yes, great idea. I will launch the several threads test on the fastest models that I obtained, and add them to the csv file.

Btw, got confused about your claim that reduce_sum does not use open MPI. As far as I understood, Stan uses MPI for within chain parallelization; and as far as I understood within chain parallelization is done either through reduce_sum or map functions. Could you elaborate on this please?

wds15 · March 24, 2021, 7:23pm

MPI is only used for map_rect and nothing else.

Reduce_sum can only take advantage of threads.

jmaronas · March 24, 2021, 7:28pm

okay nice. Actually I got surprised that the results using openmpi where only “slightly” better. But as you point out it has to be noise.

Actually, one of the interesting things from this time comparison is that row vs column matrix indexing does not always make a difference, as I expected before launching this. In some settings column indexing is clearly better, although in other settings row indexing is either much better or slightly better. One could think it might be masked by the memory overhead of copying a big set of parameters into the reduce_sum thread. However, the result shown by the experiment partial_sum_SLICED_ARGS_logit_SHARED_ARGS_y (which passes the parameters as a sliced argument and the labels, which is data, as reference) takes more time than the same model where row indexing is performed (partial_sum_SLICED_ARGS_logit_SHARED_ARGS_y_row_indexing_and_transposition). The difference here is that the logit matrix is either indexed by rows and then transposed, or directly indexed by columns (which should be faster).

Will elaborate a bit more on this claims once I upload to Github.

wds15 · March 24, 2021, 7:43pm

Indexing is a bit of a mystery in terms of what is fastest.

Have a look at the brms auto generated code for models with random effects. You will be surprised by the many copies of the parameters which are made to loop over these… which is faster. I have myself once found that looping over real arrays can be faster than looping over vectors. I have not followed up on these things, but if you have the means… go for it.

The vignette in brms about within-chain parallelization is also a worthwhile read to get to know some details of reduce_sum.

Brms = R package on CRAN in case that is unfamiliar…

stevebronder · March 24, 2021, 11:12pm

Previously this was because indexing over arrays avoided an out of bounds check, but now we do that so they should be the same speed

jmaronas · March 25, 2021, 8:35am

Thank you. I have already looked into some stan code generated by brms so will take a closer look at these things.

On the other hand, the purpose of doing this checking was getting familiar with Stan models and how to make them fast (I am new to Stan). So in the mid future I wont probably follow up on these things. But I will have all these in an open access repo for people to follow up if they are interested.

jmaronas · March 26, 2021, 4:23pm

@wds15 I have finally pushed everything into Github so that it does not get lost in my computer.

I have also added a hierarchical Bayesian Neural Net as that used by Radford Neal in his phd thesis, for anyone interested Machine.Learning.Models.pytorch/Bayesian_Neural_Net_categorical_GLM_no_partial_sum.stan at master · jmaronas/Machine.Learning.Models.pytorch · GitHub

wds15 · March 26, 2021, 6:45pm

Great. Skinning over the neural net model I see an fmax function…that is really bad for the performance of nuts which is based on gradients. Can you replace that with some continuous function possibly?

jmaronas · March 26, 2021, 7:26pm

Yeah, I saw that performance warning in the documentation of the fmax function. However the implementation considers state of the art activation functions commonly used in Deep NNets, so that is the reason why I coded up using a Rectifier linear unit activation.

Why is that so bar for gradients? A couple of years ago I remember coding up some cuda kernels for computing forward and backward of fmax type functions used in the aforementioned ReLu, and I remember performance wasn’t that bad. Here is the code of the implementation: CULAYERS/gpu_kernels.cu at 2597fca8c0a93102ef0e259d306736e730a798a0 · jmaronas/CULAYERS · GitHub

wds15 · March 26, 2021, 7:35pm

fmax implies a discontinuity. You should be able to replace that with a logit type step function which you can tune in it’s steepness.

Topic		Replies	Views
Parallelization in Stan General	6	447	October 24, 2020
Reduce_sum performance Modeling	5	783	May 22, 2020
Reduce_sum results in much slower run times, even for large datasets Algorithms paralellization	6	1151	March 17, 2022
Have you been using some of the latest features of Stan? General	14	2355	November 12, 2021
RStan parallelising using reduce_sum() Modeling	1	345	July 30, 2021

Measuring and comparing computational performance in Stan with different compilation alternatives. Using reduce_sum does not bring any advantage?

Related Topics