Much smaller benefits when using vectorized beta_binomial_lpmf compared to normal_lpdf

Guido_Biele · August 9, 2018, 9:18am

[Edit: removed use of custom function beta_binomial2_lpmf, results remain unchanged]

I am trying to speed up a hierarchical beta-binomial regression through vectorization, but see much smaller performance improvements compared to when I use vectorization to speed up a gaussian model.

Focusing on the relevant code sections from the model block, here is the original version of the gaussian model (from a slightly modified brms-generated model):

vector[N] mu = temp_Intercept + Xc * b;
vector[N] sigma = temp_sigma_Intercept + rep_vector(0, N);
for (n in 1:N) { 
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    sigma[n] += r_2_sigma_1[J_2[n]] * Z_2_sigma_1[n];
    sigma[n] = exp(sigma[n]); 
}
target += normal_lpdf(Y | mu, sigma);

and here is the vectorized version of the gaussian model, where N_1 is the umber of groups and grpsz, sidx, and eidx are vectors with group sizes, and the first and last row indices for a group:

for (j in 1:N_1) {
    vector[grpsz[j]] mu = temp_Intercept +
                          Xc[sidx[j]:eidx[j],] * b + r_1_1[j] + 
                          r_1_2[j] * Z_1_2[sidx[j]:eidx[j]];
    real sigma = exp(temp_sigma_Intercept + r_2_sigma_1[j]);
    target += normal_lpdf(Y[sidx[j]:eidx[j]] | mu, sigma);
  }

For a hierarchical gaussian regression with varying intercept, slope and error variance, (5 groups of 100 observations) the vectorized model samples 1.7 times faster then the original model (I ran each model 10 times with identical data and seeds). The speed gain grows as the number of observations per group increases.

Following the pattern above I also modified this original version of a beta-binomial regression (from brms):

vector[N] mu = temp_Intercept + Xc * b;
vector[N] phi = temp_phi_Intercept + rep_vector(0, N);
for (n in 1:N) { 
   mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
   phi[n] += r_2_phi_1[J_2[n]] * Z_2_phi_1[n];
   phi[n] = exp(phi[n]); 
   mu[n] = inv_logit(mu[n]); 
  target += beta_binomial_lpmf(Y[n] | mu[n]*phi[n], (1-mu[n])*phi[n], trials[n]);
}

to this vectorized version of a beta-binomial regression:

for (j in 1:N_1) {
  vector[grpsz[j]] mu = inv_logit(
                        temp_Intercept + 
                        Xc[sidx[j]:eidx[j]] * b + 
                        r_1_1[j] + r_1_2[j] * Z_1_2[sidx[j]:eidx[j]]);
  real phi = exp(temp_phi_Intercept + r_2_phi_1[j]);
  target += beta_binomial_lpmf(Y[sidx[j]:eidx[j]] | T , mu * phi, (1 - mu) * phi);
}

Differently than for the gaussian model, I don’t see any speed up for the vectorized compared to the original beta-binomial model.

Am I doing something wrong here, or maybe vecorization in stan for beta_binomial_lpmf is not implemented as efficiently as for normal_lpdf?

Thanks in advance!

PS: I tested this on R with Rtools 3.4.0, rstan 2.17.3, and brms 2.4.0. Here is the relevant section of the sesson info:

R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

stijn · August 9, 2018, 9:51am

Here is my guess:

The biggest advantage of vectorizing in Stan is that Stan knows how to take care of fixed elements in the vectorised probability distributions. In your gaussian example, the sigma is constant (in each group) and Stan can keep the auto-diff simpler than in the non-vectorized version. In the beta-binomial example, the parameters are mu * phi and (1 - mu) * phi which are vectors from Stan’s point of view. So there is no room for auto-diff optimisation. If I am correct, you will see the speed up when you use the beta_binomial2 parameterisation because now phi is fixed (within each group) and Stan can optimise.

Guido_Biele · August 9, 2018, 10:03am

What you wrote makes sense to me.

Also good that you mentioned beta_binomial2_lpmf: This is a custom function, not a standard function from stan. I’ll edit the original code to remove the use of the custom function, so as to not confuse others.

For reference, here is the custom function

real beta_binomial2_lpmf(int y, real mu, real phi, int T) {
  return beta_binomial_lpmf(y | T, mu * phi, (1 - mu) * phi);
}

Bob_Carpenter · August 20, 2018, 8:40am

That’s half the equation. The other half is that it builds a smaller expression graph and requires fewer virtual function calls.

stijn · August 20, 2018, 9:58am

That was what I meant by “Stan knows how to take care of …” but I could not remember “expression graph”. I am glad it still made sense at a very high level.

Bob_Carpenter · August 22, 2018, 9:25am

In case there’s any remaining confusion, there are a couple different things going on here.

Even if there’s no reduction in total number of floating point operations, vectorizing will reduce the size of the expression graph by including only a single link from the result to the operands in the expression graph. A sequence of += operations on target() will instead introduce an extra edge and node for each element of the vectorized call. This big savings is that each extra edge is an addition and memory chasing to call a virtual function. There’s only a single virtual function call with a vectorized form.
If the call is something like normal_lpdf(y | mu, sigma) and y is a vector and sigma is a scalar, then we can save computation by only computing -log(sigma) once. If we were even more clever, we’d multiply rather than adding for each element, but at least the addition is at the double level.

Guido_Biele · August 22, 2018, 10:45am

Thanks for the explanation Bob.
Regarding your first point: Does this mean that we should see faster sampling for the vectorized version of the beta-binomial model I described above, because it replaces a sequence (tens of thousands) of += calls with much fewer (a few hand sfull) calls?

Bob_Carpenter · August 22, 2018, 11:41am

Yes. But it’s not as big a speedup as not having to calculate log(sigma) each time when log(sigma) is by far the most costly evaluation in a normal density.

Topic		Replies	Views
Vectorized use of multi_normal_lpdf Modeling performance	3	459	February 5, 2023
Vectorized Implementations of Custom Distributions Modeling	2	358	May 12, 2021
Try to find a better way to speed up a hierarchical logistic model and reduce_sum only helped a little Modeling specification , performance , paralellization	5	610	May 18, 2021
Beta binomial truncated takes 1000x more time than the non truncated version Modeling fitting-issues	10	763	May 28, 2021
Correct vectorization in hierarchical model Modeling techniques , specification	9	547	April 14, 2020

Much smaller benefits when using vectorized beta_binomial_lpmf compared to normal_lpdf

Related topics