# Efficient count data regression with large datasets

Dear Stanimals,

Apologies if this has been asked before, but I could not find an answer to my question in past topics. So here goes.

FYI: I’m using rstan v2.16.2 on a Mac with Rstudio + R v3.4.3.

I have a dataset with 15,000 records of overdispersed count data (parasite counts in 15,000 individual blood samples) from 173 communities, with unequal number of observations per community. My first goal is to efficiently fit a negative binomial distribution with unknown mean mu and shape phi to each of the 173 communities, i.e. resulting in a 173 estimates of mu and 173 estimates of phi. My question is about the approach to programming the calculation of the log-posterior of the count data, for which I see two approaches:

Approach 1) put the 15,000 individual count records in a one-dimensional array, ordered by community, and use sampling statements + vectorisation where possible in the model block:

int pos; // position of dynamic index
pos = 1;
for(i in 1:N_vill) {
segment(count, pos, N_obs_per_vill[i]) ~ neg_binomial_2_log(count_mu_log[i], count_phi[i]);
pos = pos + N_obs_per_vill[i];
}
// count is a one-dimensional data array (length 15K) of counts ordered by community
// N_vill is an data int with value 173
// N_obs_per_vill is a one-dimensional data array (length 173) of the number of count records per community
// count_mu_log and count_phi are parameter vectors (each length 173) to be estimated

Pros:

• sampling statements drop all constants from the log-posterior
Cons:
• not very fast still (at least, I’m hoping approach 2 is faster, if I get it to work)

Approach 2): for each of the 173 communities, aggregate the count data into an array of unique count values a second array with frequencies of each unique count value in the original records, and a third array with the village number that each set of counts was observed in. This reduces the 15,000 records to a mere ~2,000 records. Then use the NB-PMF to calculate the log-posterior for each unique count value in each village and multiply by the frequency of that count, and add to the log-posterior:

target += count_frequency * neg_binomial_2_log_lpmf(count | count_mu_log[count_vill], count_k[count_vill]);
// count is a one-dimensional data int array (length 2K) of uniquely occuring counts in each community
// count_frequency is a one-dimensional data int array (length 2K) with the frequency of each unique count
// count_vill is a one-dimensional data int array (length 2K) with the village number that the count (one or more records) were observed in (values ranging from 1 to 173)
// count_mu_log and count_phi are parameter vectors (each length 173) to be estimated

Pros:

• fewer records to analyse, so hopefully faster
Cons:
• the PDF and PMF functions calculate the full posterior density/mass and do not drop constants
• the code doesn’t work because Stan’s PMF and PDF functions return a single real, even when the input to the function is a vector. this makes weighting of the individual log-posterior masses/densities impossible, unless I manually cycle through each aggregated record. That, however, would break all the nice vectorisation that I had in mind.

I’m patiently taking approach 1) now, and happy to work with this. But I guess count models could in general benefit from an efficiency gain, much like a logistic regression model is more efficient when using a binomial PMF for aggregated records of number of positives and number of observed instead a bernoulli PMF for individual zeroes and ones. Stanimals, do you think that the second approach above would provide an efficiency gain, if the code worked, e.g. with updated version of the PMF function or even a sampling statement that allows an optional argument for frequency weights?

Best,

Luc

It certainly depends on how many redundant counts you have in your data. If, say, all counts were the same, then 2) would of course increase efficiency by a lot as compared to 1).

I currently don’t see that the vectorization of 2) you have in mind will improve efficiency by a lot really. This is because in Stan vectorization is just beneficial if some terms of the PDF / PMF are shared across observations and can thus be reused. You can write you own custom PMF for this purpose, but since you only have 2000 observations (of which only some share the same parameters) and a simple model, I don’t think it’s worth the time for now. If you wanted to go with 2) I would thus suggest writing a loop over those 2000 counts each time multiplying with the frequency. This is what is usually done when weightung observations.

1 Like

Thanks Paul for your thoughts. Will work with approach one for now.

With regard to vectorisation in approach 2: for each of the 173 villages there are on average about 12 unique counts (2000/173), so I figured it would be more efficient to vectorise over those 12 values, which share a mu and phi parameter for the PMF. What would you say is the minimum number of observations worth to vectorise over if they share common PMF parameters?

There is no general rule. It all depends on the complexity of the computations you can re-use and how much of a speed increase matters to you.

In your case, my intuition says that it is probably not worth going this path unless a few seconds more or less spend during model fitting matter to you.

1 Like

You might try INLA.

Yes, I echo what @paul.buerkner said - this will likely speed things up a lot. The NBD log likelihood function is not that complicated, so you could easily write your own that is vectorized and drops the constants.

1 Like

Aki’s requested a version of the log probability functions that allow vector returns for vector inputs. The problem there is that it makes it impossible to share all the gradient information—you need to calculate the whole Jacobian rather than just a gradient as when you return the sum.

Yeah that’s a serious issue for model/tparams, but I guess would have the small benefit of avoiding loops in generated quantities when doing point wise log-lik computations.