I am not sure if this is the right category to ask, apologies if not. In the Stan Functions Reference it is written that vectorizations for exp and log are defined elementwise and thus provide only minimal efficiency speedups. I understand from this that such vectorizations are only for the convenience of not having to write a loop but they do not involve SIMD instructions, which is how I usually understand the term vectorization. Is my understanding correct? For example, in Matlab writing “y = exp(x)” is dramatically faster than looping over each element in ‘x’. Is there a reason for not vectorizing these functions in Stan and are there any plans to do it?
This is because Matlab has much higher overhead in loops than Stan. I haven’t been following Matlab for a few years, but at that time there was not yet SIMD used and it was just the extra overhead due to memory use, type checks etc happening in every iteration of for loop and only once in the vectorized form. Term vectorization is much older than use of SIMD instructions. In Stan using the array/vector/matrix argument is not faster unless some computations can be done only once, e.g. the common term in lpdfs and lpmfs. There is a thread somewhere discussing potential additional speed-up from SIMDs.
As of version 2.22 SIMD vectorisation is used for exp and log (among others) for vectors/row-vectors/matrices of doubles (i.e., variables declared in the transformed data and generated quantities blocks, or passed in as data). These vectorised versions aren’t (directly) called when autodiff variables (i.e. variables declared in the parameters block) are used, since the both the value and the derivative for each variable needs to be calculated, but are used as part of deriving these.
Note that this vectorisation isn’t used for arrays, like when calling y = exp(x) for the following types:
But this is being implemented at the moment, and should be available in the next release
My bad in introducing that term in the first place. I wasn’t aware of its tighter usage in SIMD contexts. I should’ve used “elementwise” for what I meant by “vectorize”.
Aside from semantics, as @bgoodri outlined, the problem we have is that any SIMD speedups from sse, avx, etc. aren’t going to be the game changers they’d be if our data was originally just in contiguous memory. Instead, we have to unpack our structures into sequences from pointers into potentially non-contiguous memory. After that, we can call Eigen’s SIMD versions as @andrjohns pointed out. I’m not sure how much of a speedup that produced end to end on applying exp() to a sequence.
The other thing we can do, but aren’t doing yet, is cut down on virtual function calls. If vectorization were implemented with adjoint-jacobian-apply, we could reduce everything to a single chain() call for each autodiff variable.
Thank you all for your replies and for clarifying. The reason I asked about vectorization of log() and exp() is that I have a model where I define a density that involves looping over a vector and running pow(x,y) on each element of the vector. The model spends quite a bit of time there so I wondered whether the model would benefit from removing the loop by writing exp(y*log(x)). From what you are saying, this obviously would not bring any improvement in performance, and I suspect it could in fact be slower than the loop. Thanks again.
That — or at least if you do elementwise multiplication with .* instead of * — requires three loops in the C++ code. It is essentially equivalent to
Eigen::Matrix<var, -1, 1> log_x(N), ylog_x(N), elog_x(N);
for (int n = 0; n < N; n++) log_x[n] = log(x[n]);
for (int n = 0; n < N; n++) ylog_x[n] = y[n] * log_x[n];
for (int n = 0; n < N; n++) elog_x[n] = exp(ylog_x[n]);