Much smaller benefits when using vectorized beta_binomial_lpmf compared to normal_lpdf



[Edit: removed use of custom function beta_binomial2_lpmf, results remain unchanged]

I am trying to speed up a hierarchical beta-binomial regression through vectorization, but see much smaller performance improvements compared to when I use vectorization to speed up a gaussian model.

Focusing on the relevant code sections from the model block, here is the original version of the gaussian model (from a slightly modified brms-generated model):

vector[N] mu = temp_Intercept + Xc * b;
vector[N] sigma = temp_sigma_Intercept + rep_vector(0, N);
for (n in 1:N) { 
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    sigma[n] += r_2_sigma_1[J_2[n]] * Z_2_sigma_1[n];
    sigma[n] = exp(sigma[n]); 
target += normal_lpdf(Y | mu, sigma); 

and here is the vectorized version of the gaussian model, where N_1 is the umber of groups and grpsz, sidx, and eidx are vectors with group sizes, and the first and last row indices for a group:

for (j in 1:N_1) {
    vector[grpsz[j]] mu = temp_Intercept +
                          Xc[sidx[j]:eidx[j],] * b + r_1_1[j] + 
                          r_1_2[j] * Z_1_2[sidx[j]:eidx[j]];
    real sigma = exp(temp_sigma_Intercept + r_2_sigma_1[j]);
    target += normal_lpdf(Y[sidx[j]:eidx[j]] | mu, sigma);

For a hierarchical gaussian regression with varying intercept, slope and error variance, (5 groups of 100 observations) the vectorized model samples 1.7 times faster then the original model (I ran each model 10 times with identical data and seeds). The speed gain grows as the number of observations per group increases.

Following the pattern above I also modified this original version of a beta-binomial regression (from brms):

vector[N] mu = temp_Intercept + Xc * b;
vector[N] phi = temp_phi_Intercept + rep_vector(0, N);
for (n in 1:N) { 
   mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
   phi[n] += r_2_phi_1[J_2[n]] * Z_2_phi_1[n];
   phi[n] = exp(phi[n]); 
   mu[n] = inv_logit(mu[n]); 
  target += beta_binomial_lpmf(Y[n] | mu[n]*phi[n], (1-mu[n])*phi[n], trials[n]);

to this vectorized version of a beta-binomial regression:

for (j in 1:N_1) {
  vector[grpsz[j]] mu = inv_logit(
                        temp_Intercept + 
                        Xc[sidx[j]:eidx[j]] * b + 
                        r_1_1[j] + r_1_2[j] * Z_1_2[sidx[j]:eidx[j]]);
  real phi = exp(temp_phi_Intercept + r_2_phi_1[j]);
  target += beta_binomial_lpmf(Y[sidx[j]:eidx[j]] | T , mu * phi, (1 - mu) * phi);

Differently than for the gaussian model, I don’t see any speed up for the vectorized compared to the original beta-binomial model.

Am I doing something wrong here, or maybe vecorization in stan for beta_binomial_lpmf is not implemented as efficiently as for normal_lpdf?

Thanks in advance!

PS: I tested this on R with Rtools 3.4.0, rstan 2.17.3, and brms 2.4.0. Here is the relevant section of the sesson info:

R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)


Here is my guess:

The biggest advantage of vectorizing in Stan is that Stan knows how to take care of fixed elements in the vectorised probability distributions. In your gaussian example, the sigma is constant (in each group) and Stan can keep the auto-diff simpler than in the non-vectorized version. In the beta-binomial example, the parameters are mu * phi and (1 - mu) * phi which are vectors from Stan’s point of view. So there is no room for auto-diff optimisation. If I am correct, you will see the speed up when you use the beta_binomial2 parameterisation because now phi is fixed (within each group) and Stan can optimise.


What you wrote makes sense to me.

Also good that you mentioned beta_binomial2_lpmf: This is a custom function, not a standard function from stan. I’ll edit the original code to remove the use of the custom function, so as to not confuse others.

For reference, here is the custom function

real beta_binomial2_lpmf(int y, real mu, real phi, int T) {
  return beta_binomial_lpmf(y | T, mu * phi, (1 - mu) * phi);