Dear Stanimals,
Apologies if this has been asked before, but I could not find an answer to my question in past topics. So here goes.
FYI: I’m using rstan v2.16.2 on a Mac with Rstudio + R v3.4.3.
I have a dataset with 15,000 records of overdispersed count data (parasite counts in 15,000 individual blood samples) from 173 communities, with unequal number of observations per community. My first goal is to efficiently fit a negative binomial distribution with unknown mean mu and shape phi to each of the 173 communities, i.e. resulting in a 173 estimates of mu and 173 estimates of phi. My question is about the approach to programming the calculation of the log-posterior of the count data, for which I see two approaches:
Approach 1) put the 15,000 individual count records in a one-dimensional array, ordered by community, and use sampling statements + vectorisation where possible in the model block:
int pos; // position of dynamic index
pos = 1;
for(i in 1:N_vill) {
segment(count, pos, N_obs_per_vill[i]) ~ neg_binomial_2_log(count_mu_log[i], count_phi[i]);
pos = pos + N_obs_per_vill[i];
}
// count is a one-dimensional data array (length 15K) of counts ordered by community
// N_vill is an data int with value 173
// N_obs_per_vill is a one-dimensional data array (length 173) of the number of count records per community
// count_mu_log and count_phi are parameter vectors (each length 173) to be estimated
Pros:
- easy to read
- sampling statements drop all constants from the log-posterior
Cons: - not very fast still (at least, I’m hoping approach 2 is faster, if I get it to work)
Approach 2): for each of the 173 communities, aggregate the count data into an array of unique count values a second array with frequencies of each unique count value in the original records, and a third array with the village number that each set of counts was observed in. This reduces the 15,000 records to a mere ~2,000 records. Then use the NB-PMF to calculate the log-posterior for each unique count value in each village and multiply by the frequency of that count, and add to the log-posterior:
target += count_frequency * neg_binomial_2_log_lpmf(count | count_mu_log[count_vill], count_k[count_vill]);
// count is a one-dimensional data int array (length 2K) of uniquely occuring counts in each community
// count_frequency is a one-dimensional data int array (length 2K) with the frequency of each unique count
// count_vill is a one-dimensional data int array (length 2K) with the village number that the count (one or more records) were observed in (values ranging from 1 to 173)
// count_mu_log and count_phi are parameter vectors (each length 173) to be estimated
Pros:
- fewer records to analyse, so hopefully faster
Cons: - the PDF and PMF functions calculate the full posterior density/mass and do not drop constants
- the code doesn’t work because Stan’s PMF and PDF functions return a single real, even when the input to the function is a vector. this makes weighting of the individual log-posterior masses/densities impossible, unless I manually cycle through each aggregated record. That, however, would break all the nice vectorisation that I had in mind.
I’m patiently taking approach 1) now, and happy to work with this. But I guess count models could in general benefit from an efficiency gain, much like a logistic regression model is more efficient when using a binomial PMF for aggregated records of number of positives and number of observed instead a bernoulli PMF for individual zeroes and ones. Stanimals, do you think that the second approach above would provide an efficiency gain, if the code worked, e.g. with updated version of the PMF function or even a sampling statement that allows an optional argument for frequency weights?
Best,
Luc