Gradient Bottleneck: Vectorizing Summations

shira · April 28, 2020, 5:20pm

for (n in 1:N)
  total += foo(n,...);

has to create intermediate representations for each of the N summands, whereas

{
  vector[N] summands;
  for (n in 1:N)
    summands[n] = foo(n,...);
  total = sum(summands);
}

gives more efficient algorithmic/automatic differentiation.

But is this true even if foo() is not adding to the log probability accumulator?

The Stan Functions Reference says:

each sampling statement ( ~ ) and log probability increment statement ( increment_log_prob ) adding to the accumulator…

(though I think increment_log_prob has been deprecated and is now target +=)

The Stan Functions Reference also says:

vectorized log probability functions are faster than their equivalent form defined with loops. This isn’t because loops are slow in Stan, but because more efficient automatic differentiation can be used. The efficiency comes from the fact that a vectorized log probability function only introduces one new node into the expression graph, thus reducing the number of virtual function calls required to compute gradients in C++, as well as from allowing caching of repeated computations.

bbbales2 · April 28, 2020, 7:10pm

The same logic should hold. I’m not sure about the practical performance difference you’d see.

We just put out a new parallel thing in 2.23 you might try: Reduce Sum: A Minimal Example

shira · April 28, 2020, 7:13pm

thanks!

what is the “same logic”?

is the second code block more efficient than the first, even if foo() does not add anything to the log probability?

bbbales2 · April 28, 2020, 7:18pm

It should be less virtual function calls, yeah, but I’m not convinced it would matter much.

target and total are both vars, there’s just extra guardrails on what you can do with target.

shira · April 28, 2020, 7:20pm

oh is total also the log probability accumulator (like target)? I thought total was just an arbitrary variable in the example.

bbbales2 · April 28, 2020, 7:27pm

Nono, total isn’t the target density – what I meant to say was the same logic applies.

target is just a special variable that you can only +=.

total would be the same kind of variable but it’s not restricted.

shira · April 28, 2020, 7:29pm

Hmm, then why would the second code block be more efficient than the first code block, if nothing is happening with the log probability (and hence, algorithmic/automatic differentiation)?

bbbales2 · April 28, 2020, 7:33pm

We’re still autodiffing through total.

Imagine if the code looked like this:

real total = 0.0;
for (n in 1:N)
  total += foo(n,...);
target += total;

shira · April 28, 2020, 8:07pm

thanks again !

To try to put it all together and see how this quote from The Stan Functions Reference applies:

The efficiency comes from the fact that a vectorized log probability function only introduces one new node into the expression graph, thus reducing the number of virtual function calls required to compute gradients in C++

continual incrementation is slow because of many new nodes introduced into the expression graph ? even though the target += is outside of the loop?

real total = 0.0;
for (n in 1:N)
  total += foo(n,...);
target += total;

vectorized summation is fast, because of only one new node introduced into the expression graph ?

vector[N] summands;
for (n in 1:N)
  summands[n] = foo(n,...);
total = sum(summands);
target += total;

vectorized log probability function is fast, because of only one new node introduced into the expression graph ?

vector[N] x_beta_ll;
for (n in 1:N)
  x_beta_ll[n] = x[n] * beta[ll[n]];
target += bernoulli_logit_lpmf( y | x_beta_ll);      // i.e. y ~ bernoulli_logit(x_beta_ll);

bbbales2 · April 28, 2020, 8:11pm

I would say the difference between 1 and 2 would be small, but yeah, the direction is correct.

#3 yes, but most of the speed comes from saving in shared computations or in doing fancy gradients.

The fastest lpdfs are the glm ones, for instance: https://mc-stan.org/docs/2_23/functions-reference/bernoulli-logit-glm.html

shira · April 28, 2020, 8:15pm

is the difference between 1 and 2 due to differences in the number of nodes in the expression graph for computing gradients ?

bbbales2 · April 28, 2020, 8:19pm

Yes, that is the difference. Is there a particular section of a model you’re wanting to speed up?

shira · April 28, 2020, 8:42pm

I was trying to understand where it helps to pull things out of loops.

I’m still confused why 1 is slow even though the target += is outside of the loop.

And you said in 3 the “speed comes from saving in shared computations or in doing fancy gradients”, but this is true also for 2 ?

bbbales2 · April 28, 2020, 9:37pm

A sum is so simple that there aren’t really any opportunities to get fancy.

Behind the scenes the operation:

total += x

And:

target += x

are the same. target is just a specially named variable for the purposes of the language.

total would just be an intermediate quantity. Everything that goes into calculations that go into target += statements get autodiffed – not just things that touch target.

Read section 1.1 here: https://arxiv.org/pdf/1509.07164.pdf. I think that has enough information on how the autodiff works.

shira · April 29, 2020, 12:44pm

@shira: And you said in 3 the “speed comes from saving in shared computations or in doing fancy gradients”, but this is true also for 2 ?

@bbbales2: sum is so simple that there aren’t really any opportunities to get fancy.

Thanks again ! That makes sense, though The Stan User’s Guide gives Vectorizing Summations as the first example under Gradient Bottleneck.

bbbales2 · April 29, 2020, 1:02pm

Aaah, I see. Simple operations benefit the least from this sorta thing, but they are also the easiest to explain, so I guess that’s why it’s first.

shira · April 29, 2020, 1:25pm

got it ! I didn’t understand the connection from gradient to the summation (which at first didn’t seem directly connected to gradients) and whatever was contributing to speed-up. Thanks again for helping to clarify the User’s Guide. :)

Topic		Replies	Views
Vectorization of target Modeling	1	307	August 10, 2020
How to vectorize a for loop in Stan? Modeling	2	352	April 6, 2023
Efficient code to sum two dependent random variables Modeling techniques	14	1033	July 19, 2023
Accumulated log probability with reduce_sum Modeling	2	289	November 13, 2020
Element wise power function for array or matrix Modeling	6	3017	November 8, 2018

Gradient Bottleneck: Vectorizing Summations

Related topics