Gradient Bottleneck: Vectorizing Summations

The Stan User’s Guide says that

for (n in 1:N)
  total += foo(n,...);

has to create intermediate representations for each of the N summands, whereas

{
  vector[N] summands;
  for (n in 1:N)
    summands[n] = foo(n,...);
  total = sum(summands);
}

gives more efficient algorithmic/automatic differentiation.

But is this true even if foo() is not adding to the log probability accumulator?

The Stan Functions Reference says:

each sampling statement ( ~ ) and log probability increment statement ( increment_log_prob ) adding to the accumulator…

(though I think increment_log_prob has been deprecated and is now target +=)

The Stan Functions Reference also says:

vectorized log probability functions are faster than their equivalent form defined with loops. This isn’t because loops are slow in Stan, but because more efficient automatic differentiation can be used. The efficiency comes from the fact that a vectorized log probability function only introduces one new node into the expression graph, thus reducing the number of virtual function calls required to compute gradients in C++, as well as from allowing caching of repeated computations.

The same logic should hold. I’m not sure about the practical performance difference you’d see.

We just put out a new parallel thing in 2.23 you might try: https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html

thanks!

what is the “same logic”?

is the second code block more efficient than the first, even if foo() does not add anything to the log probability?

It should be less virtual function calls, yeah, but I’m not convinced it would matter much.

target and total are both vars, there’s just extra guardrails on what you can do with target.

oh is total also the log probability accumulator (like target)? I thought total was just an arbitrary variable in the example.

Nono, total isn’t the target density – what I meant to say was the same logic applies.

target is just a special variable that you can only +=.

total would be the same kind of variable but it’s not restricted.

Hmm, then why would the second code block be more efficient than the first code block, if nothing is happening with the log probability (and hence, algorithmic/automatic differentiation)?

We’re still autodiffing through total.

Imagine if the code looked like this:

real total = 0.0;
for (n in 1:N)
  total += foo(n,...);
target += total;

thanks again !

To try to put it all together and see how this quote from The Stan Functions Reference applies:

The efficiency comes from the fact that a vectorized log probability function only introduces one new node into the expression graph, thus reducing the number of virtual function calls required to compute gradients in C++

  1. continual incrementation is slow because of many new nodes introduced into the expression graph ? even though the target += is outside of the loop?
real total = 0.0;
for (n in 1:N)
  total += foo(n,...);
target += total;
  1. vectorized summation is fast, because of only one new node introduced into the expression graph ?
vector[N] summands;
for (n in 1:N)
  summands[n] = foo(n,...);
total = sum(summands);
target += total;
  1. vectorized log probability function is fast, because of only one new node introduced into the expression graph ?
vector[N] x_beta_ll;
for (n in 1:N)
  x_beta_ll[n] = x[n] * beta[ll[n]];
target += bernoulli_logit_lpmf( y | x_beta_ll);      // i.e. y ~ bernoulli_logit(x_beta_ll);

I would say the difference between 1 and 2 would be small, but yeah, the direction is correct.

#3 yes, but most of the speed comes from saving in shared computations or in doing fancy gradients.

The fastest lpdfs are the glm ones, for instance: https://mc-stan.org/docs/2_23/functions-reference/bernoulli-logit-glm.html

is the difference between 1 and 2 due to differences in the number of nodes in the expression graph for computing gradients ?

Yes, that is the difference. Is there a particular section of a model you’re wanting to speed up?

I was trying to understand where it helps to pull things out of loops.

I’m still confused why 1 is slow even though the target += is outside of the loop.

And you said in 3 the “speed comes from saving in shared computations or in doing fancy gradients”, but this is true also for 2 ?

A sum is so simple that there aren’t really any opportunities to get fancy.

Behind the scenes the operation:

total += x

And:

target += x

are the same. target is just a specially named variable for the purposes of the language.

total would just be an intermediate quantity. Everything that goes into calculations that go into target += statements get autodiffed – not just things that touch target.

Read section 1.1 here: https://arxiv.org/pdf/1509.07164.pdf. I think that has enough information on how the autodiff works.

@shira: And you said in 3 the “speed comes from saving in shared computations or in doing fancy gradients”, but this is true also for 2 ?

@bbbales2: sum is so simple that there aren’t really any opportunities to get fancy.

Thanks again ! That makes sense, though The Stan User’s Guide gives Vectorizing Summations as the first example under Gradient Bottleneck.

1 Like

Aaah, I see. Simple operations benefit the least from this sorta thing, but they are also the easiest to explain, so I guess that’s why it’s first.

1 Like

got it ! I didn’t understand the connection from gradient to the summation (which at first didn’t seem directly connected to gradients) and whatever was contributing to speed-up. Thanks again for helping to clarify the User’s Guide. :)