rstan recommends on its wiki to turn on -march=native
.
If youāre worried about that, start a new one.
Nope. We should try it and see what happens. The linked page says:
On the x86-64 architecture, SSE2 is generally enabled by default, but you can enable AVX and FMA for better performance
And I thought we were doing this:
On GCC and clang you can simply pass -march=native to let the compiler enables all instruction set that are supported by your CPU.
If not, we probably should!
I got nervous about this auto
thing and finally sat down with a benchmarking library and figured out how to use it, partially by watching this pretty great talk: https://www.youtube.com/watch?v=nXaxk27zwlk
So I wrote some benchmarks with Google Benchmark and the tricks in the talk, and it looks like auto
is only faster with matrices that are actually vectors, for some reason. Can anyone verify or comment on this? If not I think I will remove the replacements of MatrixXd with auto from the PR.
Iām looking at the eigen pitfalls section and seeing under section C++11 and auto
C++11 & auto
In short: do not use the auto keywords with Eigenās expressions, unless you are 100% sure about what you are doing. In particular, do not use the auto keyword as a replacement for a Matrix<> type. Here is an example:MatrixXd A, B; auto C = A*B; for(...) { ... w = C * v; ...}
In this example, the type of C is not a MatrixXd but an abstract expression representing a matrix product and storing references to A and B. Therefore, the product of A*B will be carried out multiple times, once per iteration of the for loop. Moreover, if the coefficients of A or B change during the iteration, then C will evaluate to different values.
Iām not an Eigen expert, but looking at your benchmark it looks similar. Not sure if you saw this or not so throwing it out there
Yep, Iām benchmarking auto vs. not using auto for specific operations and resulting use-cases. In the code I was looking at in the linked PR above, weāre doing a multiply and transpose ish and then copying the data out in a single loop. I was initially surprised that auto
provided a 35% speedup on our sole performance test in Jenkins (the logistic regression one), so I dug in further. Seems like thatās only true when weāre dealing with Eigen Vectors, and not Eigen Matrices, for some reason. So now Iām wondering if anyone can rationalize that or knows more about it before I remove some of the updates in the PR.
Cool. Iāll have to watch the talk. And learn what auto _
means.
That sure does.
Just think through what the template expressions are doing. When you have matrix times matrix
and Eigen leaves it as an expression, itās a \mathcal{O}(n) operation to grab a member, and thereās no memory locality.
When you write it out to a base Matrix
type, it gets evaluated once and copied. The copy is expensive, but chepaer than matrix transpose times matrix.
So to paraphrase, youāre saying that the amount of work each is doing is the same, but an allocation + copy beats the lazy version because of memory locality? Or are you saying the lazy version actually also does more work?
Just trace the arithmetic and memory locality. When two matrices are multiplied, the first matrix is indexed by row N x N times and the seocnd matrix is indexed by column N x N times. Itās best to transpose the first matrix once if its big enough not to fit in cache with the second matrix.
Sorry, was asking to try to make sure of some pretty basic things: There are no algorithms by which multiplying an entire matrix at once is faster than figuring out each cell answer individually, right? Unless youāre saying that when we do the entire operation at once, we can do that transpose first and get that locality, whereas when we go cell-by-cell (i.e. computing an answer for each a_{i, j} individually) we canāt do that?
IF there is just memory locality to think about and we ignore the transpose thing, Iām still not sure why the allocation + holistic multiply would beat the cell-by-cellā¦
Exactly.
In that case, there shouldnāt be any difference. By allocation plus holistic multiply, Iām assuming you mean allocation and assignment to resolve the expression template and then just ordinary matrix multiplication. When you do that, Eigen will transpose the left-hand side once for the purposes of memory locality. It should also do that if thereās an expression template, but maybe it canāt figure it out at that point.
Yeah, must not be able to figure it out. I might dive in to the assembly to see if I can verify that on these simple benchmarksā¦