Does the transpose operator add to the AD tape?

If I instantiate a vector of
vector[N] v;
and compute M * v for a given matrix M, does that have less AD tape length than
row_vector[N] rv;
and computing M * rv';?

That is, does transposing the row_vector add to the AD tape? Furthermore, is this much less efficient than just instantiating the initial vector (I can refactor my program to use vectors everywhere if so…)

Also, is there a centralized place to ask these types of questions, or a given tag I can apply/search against? These types of questions = generic speedup questions about how Stan computes forward and backward AD?

Thanks!

No it should not add to the AD tape

1 Like

Can confirm:

transformed data {
    int N = 50;
}
parameters {
    matrix[N, N] M;
    row_vector[N] rv;
    vector[N] v;
}
transformed parameters {
   vector[N] res1;
   vector[N] res2;
   profile("no-transpose") {
       res1 = M * v;
   }
   profile("transpose") {
       res2 = M * rv';
   }
}
model {
   // just some prior setting
   for(i in 1:N) {
       M[i, ] ~ std_normal();
   }
   rv ~ std_normal();
}

Produces:

name total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
no-transpose 0.834101 0.424807 0.409294 70189 3509450 70189 1001
transpose 0.821333 0.401799 0.419534 70189 3509450 70189 1001

So no additional items on the chain stack or no_chain stack, which would mean additional AD work.

4 Likes

Oh this is clever. I should have thought to do this! Thanks @rok_cesnovar !

1 Like

Here the target log prob doesn’t depend on res1 or res2 at all, so why are there any autodiff calls for computing them?

Because all the functions/operators with parameters (stan::math::var in C++) are placed on the tape and the reverse pass just iterates over the tape calling chain() for each.

But not if the functions that take parameters as input are in generated quantities right? So every transformation that is not needed for target log prob computation should go in gq.

Yeah, nothing is added to the AD tape or AD stacks in generated quantities.

Obviously, if res1 or res2 were used in an actual model, they should be in the GQ block. This is just the easiest way to check how AD performs with different expressions without having to hassle with generating a fake data set that works with the sampler. This is the simplest way of running a few thousand gradient evaluations and not care about the sampler struggles.

1 Like