Successfully implemented model, wondering if it can be optimized

I just implemented the following mixture model:

image

(small typo: there’s logit link for the mu)

From the manual, I know QR decomposition on the linear predictors can make the code more efficient. I’d appreciate it if someone could take a look at my model specification and tell me whether there’s additional room for improvement.

data {
  int<lower=0> N;       // number of data points
  int<lower=1> I;       // number of individual covariates
  int<lower=1> G;       // number of group covariates
  
  int<lower=1> A[N];    // number of attempts
  int<lower=0> y[N];    // number of successes
  
  vector[I] xi[N];      // individual covariates
  vector[G] xg[N];      // group covariates
}
transformed data {
  real nil[N];
  real z1[N];
  real z3[N];
  
  nil = rep_array(0.5,N);
  for (n in 1:N) {
z1[n] = (1-2*min([nil[n],1.0*y[n]]));
z3[n] = (1-2*min([nil[n],1.0*(A[n]-y[n])]));
  }
}
parameters {
  real<lower=0> phi;  // dispersion betabin
  
  vector[I] beta;
  matrix[2,G] gamma;
}
transformed parameters {
  real<lower=0> rho;
  
  rho = 1/phi;
}
model {
  real mu;
  vector[3] pz;
  
  phi ~ normal(0, 10);
  beta ~ normal(0, 3);
  gamma[1] ~ normal(0, 3);
  gamma[2] ~ normal(0, 3);

  for (n in 1:N) {
mu = inv_logit(beta'*xi[n]);
pz = softmax([gamma[1]*xg[n],0,gamma[2]*xg[n]]');

target +=   log(
            z1[n]*pz[1] + z3[n]*pz[3] +
            pz[2]*exp(beta_binomial_lpmf(y[n]|A[n],mu*rho,(1-mu)*rho)));
  }
}
  • Don’t ever multiply by 1.0 — it doesn’t do anything
  • No tabs, only spaces (otherwise it won’t display correctly, as you see in the original post)

I reformatted and rolled up a bit of the code so I could read it.

data {
  int<lower=0> N;       // number of data points
  int<lower=1> I;       // number of individual covariates
  int<lower=1> G;       // number of group covariates
  int<lower=1> A[N];    // number of attempts
  int<lower=0> y[N];    // number of successes
  vector[I] xi[N];      // individual covariates
  vector[G] xg[N];      // group covariates
}
transformed data {
  real nil[N] = rep_array(0.5, N);
  real z1[N];
  real z3[N];
  for (n in 1:N) {
    z1[n] = 1 - 2 * min([ nil[n], y[n] ]);
    z3[n] = 1 - 2 * min([ nil[n], A[n] - y[n] ]);
  }
}
parameters {
  real<lower = 0> phi;
  vector[I] beta;
  matrix[2, G] gamma;
}
transformed parameters {
  real<lower = 0> rho = inv(phi);
}
model {
  phi ~ normal(0, 10);
  beta ~ normal(0, 3);
  gamma[1] ~ normal(0, 3);
  gamma[2] ~ normal(0, 3);
  for (n in 1:N) {
    real mu = inv_logit(beta' * xi[n]);
    vector[3] pz = softmax([gamma[1] * xg[n], 0, gamma[2] * xg[n] ]');
    target += log(pz[1] * z1[n]
                  + pz[2] * exp(beta_binomial_lpmf(y[n] | A[n], mu * rho, (1 - mu) * rho))
                  + pz[3] * z3[n]);
  }
}

The main thing you want to do for efficiency is vectorize operations to reduce the burden on autodiff. For example, make xi into an `N x matrix, and calculate everyting you need up front with vectorized operations.

vector[I] beta;
matrix[N, I] xi;
vector[N] inv_logit_mu = inv_logit(xi * beta);
vector[N] y_alpha = rho * inv_logit_mu;
vector[N] y_beta = (1 - inv_logit_mu) * rho;
vector z_pz_1 = z1 * pz[1];

and then use inv_logit_mu[n] in the loop. This is faster because matrix multiply can be done with better memory locality and less copying than epeated extract and transpose and vector multiply.

You might also want to put that last target incrment on the log-sum-exp scale, where it’d be

target
  += log_sum_exp({ log(pz[1]) + log(z[1]), 
                   log(pz[2]) + beta_binomial_lpmf(y[n] | A[n], mu * rho, (1 - mu) * rho),
                   log(pz[3]) + log(z[3]));

That might be more stable if otherwise the lpmf would round to zero or lose a lot of precision.

Then, there’s a more efficient and stable log_softmax function built in to calculate the log(pz) terms.

Finally, I’m not sure why you want to put a half normal prior on precision—that’s a zero avoiding prior because of the inversion (not that there’s anything intrinisically wrong with this, but we like to put priors on the scale because they’re easy to undertand and then use priors with support at zero because we often use them in hierarchical settings and want to know if the data’s consistent with complete pooling).

1 Like