Manually manipulating log likelihoods for duplicate observations

I have a long likelihood calculation for a couple of integer features tx and x, that have a lot of duplicate observations. For example, I may have 1000 rows where tx=0 and x=0. I found that I can speed up the calculations a lot by just computing the log likelihood of the features once for each feature pattern (e.g., tx=0, x=0), and just multiply it by the number of observations with that pattern. This also requires reducing the size of the vector parameters p and theta by the same amount, having an entry for each feature pattern.

As an illustration, I converted something like the following model block that takes data with a large N:

model {
  p ~ beta(alpha, beta);       // vector <lower=0,upper=1.0>[N] p;
  theta ~ beta(gamma, delta);  // vector <lower=0,upper=1.0>[N] theta;
  
  for (n in 1:N) {  // where N is big
    real ll_lse;
    ll_lse = long_log_likelihood(tx[n], x[n], theta[n], p[n]);
    target += ll_lse;
  }
}

to (line ll_lse *= n_custs[n] added):

model {
  p ~ beta(alpha, beta);
  theta ~ beta(gamma, delta);
  
  // where now, N is small and sum(n_custs) == previous N
  for (n in 1:N) {
    real ll_lse;
    ll_lse = long_log_likelihood(tx[n], x[n], theta[n], p[n]);
    ll_lse *= n_custs[n];
    target += ll_lse;
  }
}

While I verified that target gets incremented by the same amount overall in both models from this block, I’m worried that I may not fully be accounting for the data transformation in the p and theta sampling statements.

I tried modifying p ~ beta(alpha, beta); to target += beta_lpdf(p | alpha, beta) .* n_custs;, but it appears that beta_lpdf returns a real instead of a vector.

I can un-vectorize it in a for-loop:

for (n in 1:N) {
    target += beta_lpdf(p[n] | alpha, beta) * n_custs[n];
    target += beta_lpdf(theta[n] | gamma, delta) * n_custs[n];
}

but since I can’t print out the ‘target’ variable, I’m not sure if these programs are equivalent. Is there another factor I need to take into account for it to run on the reduced data?

You can always save and print a temporary variable so that’s one way to verify.

It’s available as target().

1 Like