A more efficient implementation of the sum of independent binomials?

Our research group is attempting to estimate the abundance of wild fish passing a fixed location based (in part) on the number of fish captured each day. The number of fish counted per day depends on two things: the collection efficiency of the trap and the effective sample rate (a known fixed percentage per day). Of these fish some will be wild and some will be a from a hatchery. For the purpose of this question, assume that the hatchery fish are indistinguishable from the wild fish at the time of sampling (they all look alike).

However, we do have information on the number of hatchery fish released and we have a working model that allows us to estimate the overall survival and daily arrival probabilities (these sum to 1) for each batch of hatchery fish that is released. This same model also gives us a lot of information on daily collection efficiency. Now we’re following the practice of expanding the model until it fits both datasets and adding a model for the sample count data to our working model.

We can express the number of fish in the sample count as the sum of independent binomials with one of those binomials describing the wild fish contribution.

For example, when no hatchery fish are available (i.e. prior to any releases), we have sample_count[t] ~ binomial(M | omega[t] * collection_probability[t] * sample_rate[t]). This is the data augmentation trick, where M is an arbitrary large number and omega[t] is a parameter to be estimated that allows us to get the daily abundance of wild fish; collection_probability[t] is a parameter that our existing model helps inform but which we expect to gain additional information from the sample counts; sample_rate[t] is a fixed percentage that we treat as data.

As hatchery fish become available with each release, we have sample_count[t] ~ binomial(M | omega[t] * collection_probability[t] * sample_rate[t]) + binomial(N_hatchery_release[1] | S[1] * alpha[1,t] * collection_probability[t] * sample_rate[t]) + … + binomial(N_hatchery_release[G] | S[G] * alpha[G, t] * collection_probability[t] * sample_rate[t]). Where, N_hatchery_release[g] is the number of in a given hatchery release, and S[g,t] is the probability of survival of group g, and the alpha[g,t] is probability of arriving on day t for hatchery fish in group g. Again, our existing model gives us some pretty good information on this. (For the curious, a small subset of fish have electronic tags which allows us to get the time of passage for tagged fish that are collected. We’ve expanded the Cormack-Jolly-Seber model to account for travel time to estimate survival, collection, and arrival probabilities).

Obviously, with Stan we need to marginalize out this sum of discrete parameters. Here is an algorithm to do that. I’ve coded this up in Stan (code is copied below). The problem is that when tacked onto my existing model, the runtime is drastically increased. It’s running now, so I don’t know exactly how long, but the gradient evaluation when from on the order of a second or so to nine seconds. So a roughly 8 - 9x increase in runtime for a model that is already taking anywhere between a day and three days to run. Any suggestions to speed this up? I’m not sure I see an obvious way to vectorize.

Here is the code snippet for the portion of the code that implements the sum of binomials. For reference, the total number of days I have sample counts for in given year is around 200. Sample count numbers can easily be on the order of 1000 or so, but several days will be zeros. 120 - 150 is the average for a given year. That’s a lot of for loops to run through!

EDIT: My code snippet was obviously wrong. What I want to wind up with as the second to last step is a vector that hold the log probability that all of the hatchery groups contribute i to the sum. That is, if X fish are detected on day t, we want to wind up with a vector of length X + 1 where element i holds the log probability that sum(H_1 + H_2 + … H_g) = i - 1. Then we can match this to a vector that holds the log probability of the wild fish contribution which is a vector of length X + 1 where element i holds the log probability that W = X - (i - 1). I have changed the code snippet to correct this below:

for (t in 1:(first_hatch_rls_day - 1))
  n_unmarked[t] ~ binomial(M, p[t] * omega[t] * sample_rate[t]);

for (t in first_hatch_rls_day:T){
  vector[n_unmarked[t] + 1] log_prob_h;
  vector[n_unmarked[t] + 1] log_prob_w;
  
  if (hatch_grps_at_large[t] == 1){
  
    for (i in 0:n_unmarked[t])
      log_prob_h[i + 1] = binomial_lpmf(i | hatch_N_rlsd[1], 
                                        S_1[1] * alpha_1[1, t] * 
                                        p[t] * sample_rate[t]);
     
  } else {
    
    vector[n_unmarked[t] + 1] log_prob_y;
    vector[n_unmarked[t] + 1] log_prob_z;
    
    for (i in 0:n_unmarked[t]) {
      log_prob_y[i + 1] = binomial_lpmf(i | hatch_N_rlsd[1], 
                                      S_1[1] * alpha_1[1, t] * 
                                      p[t] * sample_rate[t]);
      log_prob_z[i + 1] = binomial_lpmf(n_unmarked[t] - i | hatch_N_rlsd[2], 
                                      S_1[2] * alpha_1[2, t] * 
                                      p[t] * sample_rate[t]);
    }
    
    for (i in 0:n_unmarked[t])
      log_prob_h[i + 1] = log_sum_exp(log_prob_y[1:(i + 1)] + 
                        log_prob_z[(n_unmarked[t] + 1 - i):(n_unmarked[t] + 1)]); //probability hatchery sum = i
  
    if (hatch_grps_at_large[t] > 2){
      int where_are_we;
      
      where_are_we = 3;
      
      while (where_are_we <= hatch_grps_at_large[t]){
        log_prob_y = log_prob_h;
        for (i in 0:n_unmarked[t])
          log_prob_z[i + 1] = binomial_lpmf(n_unmarked[t] - i | hatch_N_rlsd[where_are_we], 
                                            S_1[where_are_we] * 
                                            alpha_1[where_are_we, t] * 
                                            p[t] * sample_rate[t]);
                                          
        for (i in 0:n_unmarked[t]){
          log_prob_h[i + 1] = log_sum_exp(log_prob_y[1:(i + 1)] + 
                            log_prob_z[(n_unmarked[t] + 1 - i):(n_unmarked[t] + 1)]); //probability hatchery sum = i
       
        }
        
        where_are_we = where_are_we + 1;
      }      
    }
  }
  
  for (i in 0:n_unmarked[t])
    log_prob_w[i + 1] = binomial_lpmf(n_unmarked[t] - i | M, 
                                      p[t] * omega[t] * sample_rate[t]);
                                      
  target += log_sum_exp(log_prob_h + log_prob_w); 
}

While the change in efficiency does matter, you might want to look into sampling parameters (e.g.-what do you get for stepsize? what do you get for acceptance rates?) because unless your data is huge I don’t think this model should be taking 1-3 days.

There are a bunch of things you can do to speed up this code. The basic principle is to avoid recomputation when results can be cached and re-used.

There are calculations like S_1[1] * alpha_1[1, t] inside of loops over i. These should be moved up under the loop for t, calculated once, and reused for each of the i loops. There are multiple cases of this.

The declaration of log_prob_y and log_prob_z should also go up to the t loop level as they don’t depend on i. That’ll save a lot of reallocations and assignments.

Then you can start applying algebra.

In this loop:

for (i in 0:n_unmarked[t])
  log_prob_h[i + 1]
    = log_sum_exp(log_prob_y[1 : i + 1]
                   + log_prob_z[n_unmarked[t] + 1 - i
                                : n_unmarked[t] + 1]);

the log-sum-exp are doing a lot of redundant calculation because log-sum-exp is transitive and commutative and the code is continually looking at slices. First principle is this:

log_sum_exp(a + b) = log_sum_exp(log_sum_exp(a), log_sum_exp(b))

That lets us tear apart the log_prob_y and log_prob_z terms. Then

log_sum_exp(a[1 : n + 1]) = log_sum_exp(log_sum_exp(a[1 : n]), a[n + 1])

will allow a running calculation of the term log_prob_y[1:(i + 1)] that will turn a quadratic algorithm into a linear algorithm.

Then, if you can vectorize some of the log_prob_calcs, that’d be even better, but that’ll take more work.

You might be able to use transformed data to reorganize so that you don’t need conditionals and can streamline some of the computation. That’s a drastic measure, but 9 days would warrant it.

If the collection probability is small compared to the number of fish passing, you could approximate binomial with Poisson, which is faster to compute for large numbers.

Hi Bob,

Thanks for your suggestions. I will move redundant calculations to the highest level whereever it seems obvious. I will also try the recursive algorithm for binomial probabilities outlined in the paper I linked above, which might be faster.

I’m confused by your comment that log-sum-exp is transitive and commutative. I feel like I’m missing something there. On the original scale, order of operations would be to do multiplication before summation. On the log scale doesn’t this mean doing the sum of my vectors before the log-sum-exp?

You said that:

log_sum_exp(a + b) = log_sum_exp(log_sum_exp(a), log_sum_exp(b))

Ignoring the vectors for the moment and imagining that a and b are reals, the right-hand side of this equation reduces to:

log_sum_exp(log_sum_exp(a), log_sum_exp(b)) = log_sum_exp(a, b)

Are you saying that:

log_sum_exp(a + b) = log_sum_exp(a, b)

?

That expression is saying:

log(exp(a + b)) = log(exp(a) + exp(b))

But the left hand side is multiplication (exp(a +b) = exp(a) * exp(b)) and the right-hand side is addition. These two can’t be equal, right?

I’m not trying to be obnoxious, just trying to understand this concept because I feel like this part of my code is where things really slow down.

There is definitely a natural pattern here, and I think my code is sort of a boneheaded way to go about calculating the log probability.

I’m sort of spitballing here, but on my whiteboard I’ve sketched out the two elements I need to match up for each day (t): the observed count in the sample tank (call it M) and a matrix describing the logged binomial probabilities for each possible count for each group of fish. That matrix is size M by G + 1, where G is the number of hatchery groups at large. What I want to do is turn this matrix into a vector that describes all the possible combinations of binomial draws for each group that sum to M. Then I can log_sum_exp that vector to get the log probability of the observed count. For example the first G + 1 elements of this vector would be the sum of a single element from the last row of the matrix and all the elements of the first row for the remaining columns (i.e. the entire observed count could be attributed to a single group). I’m not sure I know any trick to do this, but if I take this approach my code would only need to populate the matrix of log-probabilities once per t and I could probably figure out a way to vectorize that step with a custom function.

Thanks @avehtari. It’s a good suggestion, but for some days we’re getting fairly high estimates of collection probability. The number passing on a day for any given group can actually be pretty small as well.

1 Like

The existing model (a multistate model that extends Cormack-Jolly-Seber to account for time of passage) that I’m tacking this onto is what takes 1-3 days. For that I compute the log-likelihood of an observed capture histories for each individually tagged fish. My data is huge (for the fish world). In some years we’ve got a half-million or more individually tagged fish. I’ve done a lot of work vectorizing this code and it’s as fast as I can make it. I know I’m being vague, but that’s because this is something I’ve been working on for a while with the intent of publishing a paper. I will share the code of that model once it gets a bit closer to publishing.

It’s a cutthroat world out there.

1 Like

Not anywhere I’ve worked. I’ve always found it hard to get people to read things I write and then if they do, take them seriously; theft prevention has always been the least of my worries.

1 Like

I think Krzysztof was referring to this kind of cutthroat: https://en.wikipedia.org/wiki/Cutthroat_trout

1 Like

Same genus, different species. In this case we’re looking at Oncorhynchus tshawytscha, not clarkii.

1 Like

I tend to agree with you. It’s not really theft prevention in this case though. The agency I work for has an internal policy review process and I’m limited in what I can share publicly prior to going through that.

In any case, Aki’s suggestion was a very good one. Poisson was the way to go here. Although everytime I use the Poisson to model counts of fish, I feel like I’m just making a bad pun that only mathematicians would get.

1 Like

Somebody understands me! :P

I was definitely talking about the fish! :)

2 Likes

You could use negative-binomial if that would make you feel more comfortable

We’re not supposed to talk about this joke, it’s like fight club.

Sorry I misunderstood—we had to go through patent review at Bell Labs, but they did it in parallel to submitting and didn’t review our emails (it was the 90s), so nothing really got held up. But I’ve never had an actual content editor (other than the odd NDA limiting what I could say).