# Refactoring math with more general signatures

1. Reformat the library so a lot of the general functions can take both dense and sparse matrices.
2. Add duplicate code that specializes on sparse matrices.

There has been a lot of discussion on the refactor for (1) in the past (see this Google groups post and this issue). Dan points out that Eigen coefficient access doesn’t work well with `MatrixBase`. Though Ben points out that `A.coeff()` for access instead of `operator()` would be sufficient to handle the coefficient access error Dan saw. There’s a few ways to generalize the code which I’ll cover below

TLDR; Using Matrix, MatrixBase, EigenBase, or auto almost always gives the same assembly output. We should just use the one that gives us the nicest looking code. If we are careful with `auto` we can and should use it for function signatures because it performs better on clang 6.

## Eigen with `-O3`

The below will be the output of creating a returning a value from a function that calls `A + B * A * B`

We will use the following signatures

``````// Helper for return type
template <typename Derived>
using eigen_plain_t = typename Derived::PlainObject;

// Helper for Eigen Dynamic
template <typename T>
using eigen_mat_dyn = Eigen::Matrix<T, -1, -1>;

// With Standard Matrices
template <typename T1, typename T2>
const eigen_mat_dyn<T1>& A,
const eigen_mat_dyn<T2>& B) {
eigen_mat_dyn<T1> C = A + B * A * B;
return C;

}

// With Matrix Base
template <class Derived>
const Eigen::MatrixBase<Derived>& A,
const Eigen::MatrixBase<Derived>& B) {
auto C = A + B * A * B;
return C;
}

// With Eigen Base
template <class Derived>
const Eigen::EigenBase<Derived>& A_,
const Eigen::EigenBase<Derived>& B_) {
// We have to pull out the Eigen derivied types
Derived const& A = A_.derived();
Derived const& B = B_.derived();
Derived C = A + B * A * B;
return C;

}

// With auto
template <typename T1, typename T2>
inline auto add_decl_ret(const T1& A, const T2& B) {
auto C = A + B * A * B;
return C;
}

``````

## Eigen Funcs on GCC 4.9 and `-O3`

### TLDR; All the above calls generate the same asm

Starting with the `Eigen::CommaInitializer` Every method uses pretty similar asm. Call `.L1149`, `.L1150`, `.L1151`, and then calling the internal scalar_sum_op templated by `Eigen::PlainObjectBase`

``````## Matrix
mov     rax, QWORD PTR [rsp+48]
cmp     rax, QWORD PTR [rsp+8]
jne     .L1149
mov     rdx, QWORD PTR [rsp+16]
cmp     rdx, QWORD PTR [rsp+40]
jne     .L1150
cmp     rax, rdx
mov     QWORD PTR [rsp+240], rsp
mov     QWORD PTR [rsp+248], rbx
mov     QWORD PTR [rsp+256], rsp
mov     QWORD PTR [rsp+264], rbx
jne     .L1151
lea     rsi, [rsp+240]
lea     rdi, [rsp+64]
call    Eigen::PlainObjectBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::PlainObjectBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const> >(Eigen::DenseBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const> > const&)
``````
``````## EigenBase
mov     rdx, QWORD PTR [rsp+48]
cmp     rdx, QWORD PTR [rsp+8]
jne     .L1149
mov     rax, QWORD PTR [rsp+16]
cmp     rax, QWORD PTR [rsp+40]
jne     .L1150
cmp     rdx, rax
mov     QWORD PTR [rsp+288], rsp
mov     QWORD PTR [rsp+296], rbx
mov     QWORD PTR [rsp+304], rsp
mov     QWORD PTR [rsp+312], rbx
jne     .L1151
lea     rsi, [rsp+288]
lea     rdi, [rsp+96]
call    Eigen::PlainObjectBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::PlainObjectBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const> >(Eigen::DenseBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const> > const&)
``````
``````## MatrixBase
mov     rdx, QWORD PTR [rsp+48]
cmp     rdx, QWORD PTR [rsp+8]
jne     .L1149
mov     rax, QWORD PTR [rsp+16]
cmp     rax, QWORD PTR [rsp+40]
jne     .L1150
cmp     rdx, rax
mov     QWORD PTR [rsp+336], rsp
mov     QWORD PTR [rsp+344], rbx
mov     QWORD PTR [rsp+352], rsp
mov     QWORD PTR [rsp+360], rbx
jne     .L1151
lea     rsi, [rsp+336]
lea     rdi, [rsp+128]
call    Eigen::PlainObjectBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::PlainObjectBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const> >(Eigen::DenseBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const> > const&)
``````
``````## Auto
mov     rax, QWORD PTR [rsp+48]
cmp     rax, QWORD PTR [rsp+8]
jne     .L1149
mov     rdx, QWORD PTR [rsp+16]
cmp     rdx, QWORD PTR [rsp+40]
jne     .L1150
cmp     rax, rdx
mov     QWORD PTR [rsp+192], rsp
mov     QWORD PTR [rsp+200], rbx
mov     QWORD PTR [rsp+208], rsp
mov     QWORD PTR [rsp+216], rbx
jne     .L1151
lea     rsi, [rsp+192]
lea     rdi, [rsp+160]
call    Eigen::PlainObjectBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::PlainObjectBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const> >(Eigen::DenseBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const> > const&)
``````
``````## Product With Addition
.L1151:
mov     ecx, OFFSET FLAT:Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Matrix<double, -1, -1, 0, -1, -1> const, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const>::CwiseBinaryOp(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const&, Eigen::internal::scalar_sum_op<double, double> const&)::__PRETTY_FUNCTION__
mov     edx, 110
mov     esi, OFFSET FLAT:.LC45
mov     edi, OFFSET FLAT:.LC46
call    __assert_fail
## Product Product
.L1150:
mov     ecx, OFFSET FLAT:Eigen::Product<Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0>::Product(Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const&, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&)::__PRETTY_FUNCTION__
mov     edx, 97
mov     esi, OFFSET FLAT:.LC4
mov     edi, OFFSET FLAT:.LC5
call    __assert_fail
## Product from a temporary
.L1149:
call    Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0> const Eigen::MatrixBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::operator*<Eigen::Matrix<double, -1, -1, 0, -1, -1> >(Eigen::MatrixBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&) const [clone .part.30]
mov     rdi, QWORD PTR [rsp+160]
mov     rbx, rax
call    free
``````

So, essentially we can pick whatever we want as the signature here (excluding the instances where Eigen shows that auto can segfault) and for 4.9.3 we will get the same stuff.

On a side note, using a newer version of GCC gives the same asm for everything except for `Matrix`, which does requires two extra moves and an unpack. The other calls all give the same asm.

## Eigen Funcs on Clang 6 and `-O3`

### TLDR: All the asm is the same except for `auto` which is actually inlined and a bit more optmized

This is waaaaayy more complicated. But essentially everything is also the same but clang 6 loves to duplicate code. The only interesting thing below is that the `auto` version actually inlines while the others are forced to make a call.

``````## Matrix
Eigen::Matrix<double, -1, -1, 0, -1, -1> add_mat<double, double>(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&): # @Eigen::Matrix<double, -1, -1, 0, -1, -1> add_mat<double, double>(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&)
push    r14
push    rbx
sub     rsp, 56
mov     rax, rsi
mov     rbx, rdi
mov     rsi, qword ptr [rdx + 16]
cmp     rsi, qword ptr [rax + 8]
## .LBB*_10 are the same product
jne     .LBB1_10
mov     rcx, qword ptr [rax + 16]
cmp     rcx, qword ptr [rdx + 8]
## .LBB*_11 are all the same product
jne     .LBB1_11
mov     qword ptr [rsp + 16], rax
mov     qword ptr [rsp + 24], rdx
mov     qword ptr [rsp + 32], rax
mov     qword ptr [rsp + 40], rdx
cmp     rsi, rcx
## .LBB*_12 are all the same addition
jne     .LBB1_12
xorps   xmm0, xmm0
movups  xmmword ptr [rbx], xmm0
mov     qword ptr [rbx + 16], 0
test    rsi, rsi
## .LBB*_7 are all the same product
je      .LBB1_7
movabs  rax, 9223372036854775807
xor     edx, edx
idiv    rsi
cmp     rax, rsi
jl      .LBB1_5
``````
``````## EigenBase
Eigen::Matrix<double, -1, -1, 0, -1, -1>::PlainObject add_base<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1> >(Eigen::EigenBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&, Eigen::EigenBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&): # @Eigen::Matrix<double, -1, -1, 0, -1, -1>::PlainObject add_base<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1> >(Eigen::EigenBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&, Eigen::EigenBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&)
push    r14
push    rbx
sub     rsp, 56
mov     rax, rsi
mov     rbx, rdi
mov     rsi, qword ptr [rdx + 16]
cmp     rsi, qword ptr [rax + 8]
## .LBB*_10 are the same product
jne     .LBB2_10
mov     rcx, qword ptr [rax + 16]
cmp     rcx, qword ptr [rdx + 8]
## .LBB*_11 are all the same product
jne     .LBB2_11
mov     qword ptr [rsp + 16], rax
mov     qword ptr [rsp + 24], rdx
mov     qword ptr [rsp + 32], rax
mov     qword ptr [rsp + 40], rdx
cmp     rsi, rcx
## .LBB*_12 are all the same addition
jne     .LBB2_12
xorps   xmm0, xmm0
movups  xmmword ptr [rbx], xmm0
mov     qword ptr [rbx + 16], 0
test    rsi, rsi
## .LBB*_7 are all the same product
je      .LBB2_7
movabs  rax, 9223372036854775807
xor     edx, edx
idiv    rsi
cmp     rax, rsi
jl      .LBB2_5
``````
``````## MatrixBase
Eigen::Matrix<double, -1, -1, 0, -1, -1>::PlainObject add_mat_base<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1> >(Eigen::MatrixBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&, Eigen::MatrixBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&): # @Eigen::Matrix<double, -1, -1, 0, -1, -1>::PlainObject add_mat_base<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1> >(Eigen::MatrixBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&, Eigen::MatrixBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&)
push    r14
push    rbx
sub     rsp, 56
mov     rax, rsi
mov     rbx, rdi
mov     rsi, qword ptr [rdx + 16]
cmp     rsi, qword ptr [rax + 8]
## .LBB*_10 are the same product
jne     .LBB3_10
mov     rcx, qword ptr [rax + 16]
cmp     rcx, qword ptr [rdx + 8]
## .LBB*_11 are all the same product
jne     .LBB3_11
mov     qword ptr [rsp + 16], rax
mov     qword ptr [rsp + 24], rdx
mov     qword ptr [rsp + 32], rax
mov     qword ptr [rsp + 40], rdx
cmp     rsi, rcx
## .LBB*_12 are all the same addition
jne     .LBB3_12
xorps   xmm0, xmm0
movups  xmmword ptr [rbx], xmm0
mov     qword ptr [rbx + 16], 0
test    rsi, rsi
## .LBB*_7 are all the same product
je      .LBB3_7
movabs  rax, 9223372036854775807
xor     edx, edx
idiv    rsi
cmp     rax, rsi
jl      .LBB3_5
``````
``````## Auto is the only one that's inlined in main!?!?
mov     rsi, qword ptr [rsp + 32]
cmp     rsi, qword ptr [rsp + 56]
# Product
jne     .LBB0_76
mov     rax, qword ptr [rsp + 64]
cmp     rax, qword ptr [rsp + 24]
# Product
jne     .LBB0_77
cmp     rsi, rax
jne     .LBB0_78
mov     qword ptr [rsp + 184], rbx
mov     qword ptr [rsp + 192], r14
mov     qword ptr [rsp + 200], rbx
mov     qword ptr [rsp + 208], r14
xorps   xmm0, xmm0
movaps  xmmword ptr [rsp + 80], xmm0
mov     qword ptr [rsp + 96], 0
test    rsi, rsi
# Product
je      .LBB0_46
``````

## Conclusions

I think auto and generic templates are a good idea for a lot of the functions in the math library where the functions are simple. I.E. using

``````template <typename T1, typename T2, typename = enable_if_all_eigen<T1, T2>>
auto add(const T1&, const T2&) {
return A + B;
}
``````

is fine. For the simpler things like addition, subtraction, and multiplication this will let those work for sparse and dense matrices. This will give us the same performance on gcc with slightly better performance on clang. Though we should probs not use `auto` for variable declarations.

3 Likes

The `auto` code looks really nice. Maybe make a WIP PR branch and replace a bunch of them and lets see if there is more there?

Just a general question, what is the benefit of using `EigenBase` compared to `MatrixBase`? I understand that by uisng `MatrixBase` we can use the same function with `Eigen::MatrixXd` and `Eigen::Map`, etc
But what do we gain with `EigenBase`? I am an idiot, its probably Eigen::Vector…

1 Like

See the PR here!

But what do we gain with `EigenBase` ? I am an idiot, its probably Eigen::Vector…

So in the Eigen docs here they say to use EigenBase to avoid making temporaries, but the above stuff shows that compilers treat EigenBase, MatrixBase, and Matrix argument types the same**

**In gcc 9.1 using `EigenBase` or `MatrixBase` does avoid 1 move and 1 unpack operation wrt using `Matrix`

2 Likes

Great, lets discuss this on tomorrows Math meeting, seems like a refactor that could potentially benefit everyone, not just for sparse matrices.

I’m all for this! We couldn’t really do this when we were C++03 because of the combinatorial explosion of function signatures, but we can now with `auto`!!!

We still have to be careful. I’m not certain that we’re going to get the correct results back out of Eigen. Unfortunately, all the different views / types do have different implementations. This caught us out before, but hopefully we can make some progress this time around. This would be so awesome for speed too if we’re able to propagate the expression templates forward.

Yes, let’s talk about it tomorrow!

We should be moving from `operator()` and `operator[]` (just vectors) to `.coeff`. It doesn’t check ranges, but we do that everywhere ourselves, so the checks are (largely? totally?)

Cool! Presumably it’s not the same code with `Matrix` and `MatrixBase` if you pass it an expression template; I’d have thought `Matrix` would `.eval()` and the base not.

It can be problematic as they note in the Eigen doc if you wind up saving something unevaluated rather than copied as it can imply a lot more eval downstream.

`add` can be a bit more general in that `T1` or `T2` can be scalars.

@stevebronder did you hit any problems with this? If you don’t have the time for this I could tackle it. Regardless of sparse matrix support I think this should improve performance of existing functionalities.

No problems but did lose track of this. Help would def be appreciated! I have a PR I need to clean up in the stan repo that uses a bunch of this stuff