If I instantiate a vector of vector[N] v;
and compute M * v for a given matrix M, does that have less AD tape length than row_vector[N] rv;
and computing M * rv';?
That is, does transposing the row_vector add to the AD tape? Furthermore, is this much less efficient than just instantiating the initial vector (I can refactor my program to use vectors everywhere if so…)
Also, is there a centralized place to ask these types of questions, or a given tag I can apply/search against? These types of questions = generic speedup questions about how Stan computes forward and backward AD?
Because all the functions/operators with parameters (stan::math::var in C++) are placed on the tape and the reverse pass just iterates over the tape calling chain() for each.
But not if the functions that take parameters as input are in generated quantities right? So every transformation that is not needed for target log prob computation should go in gq.
Yeah, nothing is added to the AD tape or AD stacks in generated quantities.
Obviously, if res1 or res2 were used in an actual model, they should be in the GQ block. This is just the easiest way to check how AD performs with different expressions without having to hassle with generating a fake data set that works with the sampler. This is the simplest way of running a few thousand gradient evaluations and not care about the sampler struggles.