Much smaller benefits when using vectorized beta_binomial_lpmf compared to normal_lpdf

[Edit: removed use of custom function beta_binomial2_lpmf, results remain unchanged]

I am trying to speed up a hierarchical beta-binomial regression through vectorization, but see much smaller performance improvements compared to when I use vectorization to speed up a gaussian model.

Focusing on the relevant code sections from the model block, here is the original version of the gaussian model (from a slightly modified brms-generated model):

vector[N] mu = temp_Intercept + Xc * b;
vector[N] sigma = temp_sigma_Intercept + rep_vector(0, N);
for (n in 1:N) { 
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    sigma[n] += r_2_sigma_1[J_2[n]] * Z_2_sigma_1[n];
    sigma[n] = exp(sigma[n]); 
target += normal_lpdf(Y | mu, sigma); 

and here is the vectorized version of the gaussian model, where N_1 is the umber of groups and grpsz, sidx, and eidx are vectors with group sizes, and the first and last row indices for a group:

for (j in 1:N_1) {
    vector[grpsz[j]] mu = temp_Intercept +
                          Xc[sidx[j]:eidx[j],] * b + r_1_1[j] + 
                          r_1_2[j] * Z_1_2[sidx[j]:eidx[j]];
    real sigma = exp(temp_sigma_Intercept + r_2_sigma_1[j]);
    target += normal_lpdf(Y[sidx[j]:eidx[j]] | mu, sigma);

For a hierarchical gaussian regression with varying intercept, slope and error variance, (5 groups of 100 observations) the vectorized model samples 1.7 times faster then the original model (I ran each model 10 times with identical data and seeds). The speed gain grows as the number of observations per group increases.

Following the pattern above I also modified this original version of a beta-binomial regression (from brms):

vector[N] mu = temp_Intercept + Xc * b;
vector[N] phi = temp_phi_Intercept + rep_vector(0, N);
for (n in 1:N) { 
   mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
   phi[n] += r_2_phi_1[J_2[n]] * Z_2_phi_1[n];
   phi[n] = exp(phi[n]); 
   mu[n] = inv_logit(mu[n]); 
  target += beta_binomial_lpmf(Y[n] | mu[n]*phi[n], (1-mu[n])*phi[n], trials[n]);

to this vectorized version of a beta-binomial regression:

for (j in 1:N_1) {
  vector[grpsz[j]] mu = inv_logit(
                        temp_Intercept + 
                        Xc[sidx[j]:eidx[j]] * b + 
                        r_1_1[j] + r_1_2[j] * Z_1_2[sidx[j]:eidx[j]]);
  real phi = exp(temp_phi_Intercept + r_2_phi_1[j]);
  target += beta_binomial_lpmf(Y[sidx[j]:eidx[j]] | T , mu * phi, (1 - mu) * phi);

Differently than for the gaussian model, I don’t see any speed up for the vectorized compared to the original beta-binomial model.

Am I doing something wrong here, or maybe vecorization in stan for beta_binomial_lpmf is not implemented as efficiently as for normal_lpdf?

Thanks in advance!

PS: I tested this on R with Rtools 3.4.0, rstan 2.17.3, and brms 2.4.0. Here is the relevant section of the sesson info:

R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Here is my guess:

The biggest advantage of vectorizing in Stan is that Stan knows how to take care of fixed elements in the vectorised probability distributions. In your gaussian example, the sigma is constant (in each group) and Stan can keep the auto-diff simpler than in the non-vectorized version. In the beta-binomial example, the parameters are mu * phi and (1 - mu) * phi which are vectors from Stan’s point of view. So there is no room for auto-diff optimisation. If I am correct, you will see the speed up when you use the beta_binomial2 parameterisation because now phi is fixed (within each group) and Stan can optimise.

What you wrote makes sense to me.

Also good that you mentioned beta_binomial2_lpmf: This is a custom function, not a standard function from stan. I’ll edit the original code to remove the use of the custom function, so as to not confuse others.

For reference, here is the custom function

real beta_binomial2_lpmf(int y, real mu, real phi, int T) {
  return beta_binomial_lpmf(y | T, mu * phi, (1 - mu) * phi);

That’s half the equation. The other half is that it builds a smaller expression graph and requires fewer virtual function calls.

That was what I meant by “Stan knows how to take care of …” but I could not remember “expression graph”. I am glad it still made sense at a very high level.

In case there’s any remaining confusion, there are a couple different things going on here.

  1. Even if there’s no reduction in total number of floating point operations, vectorizing will reduce the size of the expression graph by including only a single link from the result to the operands in the expression graph. A sequence of += operations on target() will instead introduce an extra edge and node for each element of the vectorized call. This big savings is that each extra edge is an addition and memory chasing to call a virtual function. There’s only a single virtual function call with a vectorized form.

  2. If the call is something like normal_lpdf(y | mu, sigma) and y is a vector and sigma is a scalar, then we can save computation by only computing -log(sigma) once. If we were even more clever, we’d multiply rather than adding for each element, but at least the addition is at the double level.

1 Like

Thanks for the explanation Bob.
Regarding your first point: Does this mean that we should see faster sampling for the vectorized version of the beta-binomial model I described above, because it replaces a sequence (tens of thousands) of += calls with much fewer (a few hand sfull) calls?

1 Like

Yes. But it’s not as big a speedup as not having to calculate log(sigma) each time when log(sigma) is by far the most costly evaluation in a normal density.