Does the transpose operator add to the AD tape?

mathDR · December 20, 2021, 3:00pm

If I instantiate a vector of
vector[N] v;
and compute M * v for a given matrix M, does that have less AD tape length than
row_vector[N] rv;
and computing M * rv';?

That is, does transposing the row_vector add to the AD tape? Furthermore, is this much less efficient than just instantiating the initial vector (I can refactor my program to use vectors everywhere if so…)

Also, is there a centralized place to ask these types of questions, or a given tag I can apply/search against? These types of questions = generic speedup questions about how Stan computes forward and backward AD?

Thanks!

stevebronder · December 20, 2021, 4:29pm

No it should not add to the AD tape

rok_cesnovar · December 20, 2021, 5:30pm

Can confirm:

transformed data {
    int N = 50;
}
parameters {
    matrix[N, N] M;
    row_vector[N] rv;
    vector[N] v;
}
transformed parameters {
   vector[N] res1;
   vector[N] res2;
   profile("no-transpose") {
       res1 = M * v;
   }
   profile("transpose") {
       res2 = M * rv';
   }
}
model {
   // just some prior setting
   for(i in 1:N) {
       M[i, ] ~ std_normal();
   }
   rv ~ std_normal();
}

Produces:

name	total_time	forward_time	reverse_time	chain_stack	no_chain_stack	autodiff_calls	no_autodiff_calls
no-transpose	0.834101	0.424807	0.409294	70189	3509450	70189	1001
transpose	0.821333	0.401799	0.419534	70189	3509450	70189	1001

So no additional items on the chain stack or no_chain stack, which would mean additional AD work.

mathDR · December 21, 2021, 4:12pm

Oh this is clever. I should have thought to do this! Thanks @rok_cesnovar !

jtimonen · December 22, 2021, 3:11pm

Here the target log prob doesn’t depend on res1 or res2 at all, so why are there any autodiff calls for computing them?

rok_cesnovar · December 22, 2021, 3:20pm

Because all the functions/operators with parameters (stan::math::var in C++) are placed on the tape and the reverse pass just iterates over the tape calling chain() for each.

jtimonen · December 22, 2021, 3:27pm

But not if the functions that take parameters as input are in generated quantities right? So every transformation that is not needed for target log prob computation should go in gq.

rok_cesnovar · December 22, 2021, 4:28pm

Yeah, nothing is added to the AD tape or AD stacks in generated quantities.

Obviously, if res1 or res2 were used in an actual model, they should be in the GQ block. This is just the easiest way to check how AD performs with different expressions without having to hassle with generating a fake data set that works with the sampler. This is the simplest way of running a few thousand gradient evaluations and not care about the sampler struggles.

Topic		Replies	Views
Thread performance penalty Developers	10	2218	January 18, 2019
(older) parallel AD tape ideas Developers	42	1175	November 4, 2020
Help understanding what's happening in this code General	3	481	October 8, 2019
Stan transformed parameters question General	5	404	October 13, 2020
How to use AD for types with nontrivial destructor? Developers	7	504	March 29, 2022

Does the transpose operator add to the AD tape?

Related topics