Our research group is attempting to estimate the abundance of wild fish passing a fixed location based (in part) on the number of fish captured each day. The number of fish counted per day depends on two things: the collection efficiency of the trap and the effective sample rate (a known fixed percentage per day). Of these fish some will be wild and some will be a from a hatchery. For the purpose of this question, assume that the hatchery fish are indistinguishable from the wild fish at the time of sampling (they all look alike).
However, we do have information on the number of hatchery fish released and we have a working model that allows us to estimate the overall survival and daily arrival probabilities (these sum to 1) for each batch of hatchery fish that is released. This same model also gives us a lot of information on daily collection efficiency. Now we’re following the practice of expanding the model until it fits both datasets and adding a model for the sample count data to our working model.
We can express the number of fish in the sample count as the sum of independent binomials with one of those binomials describing the wild fish contribution.
For example, when no hatchery fish are available (i.e. prior to any releases), we have sample_count[t] ~ binomial(M | omega[t] * collection_probability[t] * sample_rate[t]). This is the data augmentation trick, where M is an arbitrary large number and omega[t] is a parameter to be estimated that allows us to get the daily abundance of wild fish; collection_probability[t] is a parameter that our existing model helps inform but which we expect to gain additional information from the sample counts; sample_rate[t] is a fixed percentage that we treat as data.
As hatchery fish become available with each release, we have sample_count[t] ~ binomial(M | omega[t] * collection_probability[t] * sample_rate[t]) + binomial(N_hatchery_release[1] | S[1] * alpha[1,t] * collection_probability[t] * sample_rate[t]) + … + binomial(N_hatchery_release[G] | S[G] * alpha[G, t] * collection_probability[t] * sample_rate[t]). Where, N_hatchery_release[g] is the number of in a given hatchery release, and S[g,t] is the probability of survival of group g, and the alpha[g,t] is probability of arriving on day t for hatchery fish in group g. Again, our existing model gives us some pretty good information on this. (For the curious, a small subset of fish have electronic tags which allows us to get the time of passage for tagged fish that are collected. We’ve expanded the Cormack-Jolly-Seber model to account for travel time to estimate survival, collection, and arrival probabilities).
Obviously, with Stan we need to marginalize out this sum of discrete parameters. Here is an algorithm to do that. I’ve coded this up in Stan (code is copied below). The problem is that when tacked onto my existing model, the runtime is drastically increased. It’s running now, so I don’t know exactly how long, but the gradient evaluation when from on the order of a second or so to nine seconds. So a roughly 8 - 9x increase in runtime for a model that is already taking anywhere between a day and three days to run. Any suggestions to speed this up? I’m not sure I see an obvious way to vectorize.
Here is the code snippet for the portion of the code that implements the sum of binomials. For reference, the total number of days I have sample counts for in given year is around 200. Sample count numbers can easily be on the order of 1000 or so, but several days will be zeros. 120 - 150 is the average for a given year. That’s a lot of for loops to run through!
EDIT: My code snippet was obviously wrong. What I want to wind up with as the second to last step is a vector that hold the log probability that all of the hatchery groups contribute i to the sum. That is, if X fish are detected on day t, we want to wind up with a vector of length X + 1 where element i holds the log probability that sum(H_1 + H_2 + … H_g) = i - 1. Then we can match this to a vector that holds the log probability of the wild fish contribution which is a vector of length X + 1 where element i holds the log probability that W = X - (i - 1). I have changed the code snippet to correct this below:
for (t in 1:(first_hatch_rls_day - 1))
n_unmarked[t] ~ binomial(M, p[t] * omega[t] * sample_rate[t]);
for (t in first_hatch_rls_day:T){
vector[n_unmarked[t] + 1] log_prob_h;
vector[n_unmarked[t] + 1] log_prob_w;
if (hatch_grps_at_large[t] == 1){
for (i in 0:n_unmarked[t])
log_prob_h[i + 1] = binomial_lpmf(i | hatch_N_rlsd[1],
S_1[1] * alpha_1[1, t] *
p[t] * sample_rate[t]);
} else {
vector[n_unmarked[t] + 1] log_prob_y;
vector[n_unmarked[t] + 1] log_prob_z;
for (i in 0:n_unmarked[t]) {
log_prob_y[i + 1] = binomial_lpmf(i | hatch_N_rlsd[1],
S_1[1] * alpha_1[1, t] *
p[t] * sample_rate[t]);
log_prob_z[i + 1] = binomial_lpmf(n_unmarked[t] - i | hatch_N_rlsd[2],
S_1[2] * alpha_1[2, t] *
p[t] * sample_rate[t]);
}
for (i in 0:n_unmarked[t])
log_prob_h[i + 1] = log_sum_exp(log_prob_y[1:(i + 1)] +
log_prob_z[(n_unmarked[t] + 1 - i):(n_unmarked[t] + 1)]); //probability hatchery sum = i
if (hatch_grps_at_large[t] > 2){
int where_are_we;
where_are_we = 3;
while (where_are_we <= hatch_grps_at_large[t]){
log_prob_y = log_prob_h;
for (i in 0:n_unmarked[t])
log_prob_z[i + 1] = binomial_lpmf(n_unmarked[t] - i | hatch_N_rlsd[where_are_we],
S_1[where_are_we] *
alpha_1[where_are_we, t] *
p[t] * sample_rate[t]);
for (i in 0:n_unmarked[t]){
log_prob_h[i + 1] = log_sum_exp(log_prob_y[1:(i + 1)] +
log_prob_z[(n_unmarked[t] + 1 - i):(n_unmarked[t] + 1)]); //probability hatchery sum = i
}
where_are_we = where_are_we + 1;
}
}
}
for (i in 0:n_unmarked[t])
log_prob_w[i + 1] = binomial_lpmf(n_unmarked[t] - i | M,
p[t] * omega[t] * sample_rate[t]);
target += log_sum_exp(log_prob_h + log_prob_w);
}