Scaling of the overdispersion in negative binomial models

The negbinomial distribution allows specification of a rate term in the formula. However, in a negative binomial distribution, the rate should scale the shape (\phi) parameter as well.

While for a Poisson distribution the sum of N draws from a distribution of mean \mu has the same distribution as a draw from a distribution with mean N\mu ; for a negative binomial distribution with mean \mu and shape \phi ,the sum of N draws has the same distribution as a draw from a distribution with mean N\mu and with shape N\phi.

Not accounting for this correctly is a frequent error in applying negative binomial models to data with unequal sample sizes.

In the code below, the issue is illustrated for a simple dataset. Samples are drawn from a negative binomial, and the parameters of the distribution are estimated by fitting a negative binomial to the full dataset, and to a dataset that has been aggregated (by summing in groups of ten). Note that we are using an integer, N to scale the mean, but this could be a real number (for example, from counts taken over different durations).

library(brms)
library(data.table)
set.seed(5)
#Sample from a negative binomial with a known distribution
r = 3
p = 0.4

#The distribution should have this mean and shape:
mu = r*(1-p)/p #4.5
phi = r             #3

data = data.table(N=1, y=rnbinom(1000, size=3, prob=0.4), group=rep(seq(100), 10))

# Now fit a model to the full dataset, and to an aggregated dataset (aggregated into groups of 10):
b_1 = brm(y | rate(N)  ~ 1, family=negbinomial, data=data, save_model='nb_test.stan')
b_10 = brm(y | rate(N)  ~ 1, family=negbinomial, data=data[, .(.N, y=sum(y)), group])

#A little helper function to check that the parameter is within the credible interval
between <- function(fit, value, parameter){return(
     (value > posterior_summary(fit)[parameter, 'Q2.5']) & 
      (value < posterior_summary(fit)[parameter, 'Q97.5'])
)}

#The means from the full and aggregated datasets are as expected
stopifnot(between(b_1, log(mu), 'b_Intercept'))
stopifnot(between(b_10, log(mu), 'b_Intercept'))

# The shape from the full dataset is as expected:
stopifnot(between(b_1, phi, 'shape'))

# But the shape from the aggregated dataset gives an error
stopifnot(between(b_10, phi, 'shape'))
#Error: between(b_10, phi, "shape") is not TRUE
# In fact `posterior_summary(b_10)['shape', 'Estimate']` is around 60, around
# ten times higher than it should be

In the Stan model generated from BRMS, the contribution to the negative binomial log-likelihood is
target += neg_binomial_2_log_lpmf(Y | mu, shape);
however, I believe that it should be
target += neg_binomial_2_log_lpmf(Y | mu, exp(log_denom) * shape);.

In the meantime, I have been handling this with a custom family. Putting this here in case it is helpful to anyone else who has this issue, or is working with negative binomial models applied to data with varying sample sizes.

negbinomial_rate <- custom_family(
    "negbinomial_rate",
    dpars = c("mu", "phi"),
    links = c("log", "identity"),
    lb = c(0, 0),
    type = "int",
    vars = "vreal1[n]"
)

stan_funs <- "
    real negbinomial_rate_lpmf(int y, real mu, real phi, real V) {
        return neg_binomial_2_lpmf(y | mu * V, phi * V);
    }
    int negbinomial_rate_rng(real mu, real phi, real V) {
        return neg_binomial_2_rng(mu * V, phi * V);
    }
"
stanvars <- stanvar(scode = stan_funs, block = "functions")
b_custom = brm(y | vreal(N) ~ 1, family=negbinomial_rate, 
      data=data[, .(.N, y=sum(y)), group], stanvars=stanvars)

#This now passes the test without any trouble!
stopifnot(between(b_custom, log(mu), 'b_Intercept'))
stopifnot(between(b_custom, phi, 'phi'))

  • Operating System: Ubuntu 18.04
  • brms Version: 2.13.0
6 Likes

Thank you! Can you point to some reference where this is explained in more detail just for me to check?

You should definitely check! Unfortunately, I don’t have a handy reference. It came out of me trying to puzzle out all the various parameterisations of the negative binomial. I have some notes that give more detail, that I will polish up a bit and share, and that should make it clearer.

Thank you! That would help!

Here are my notes on the parameterisation of the Stan negative binomial models, including some discussion of how negative binomial distributions are combined:
negative-binomial.pdf (from here https://github.com/dragonfly-science/negative-binomial). The motivation of this was around figuring out (and checking) how the different parameterisations related to one another.

The short story, is that in the size and probability parameterisation, the sum of draws from two distributions \mathrm{NB}(r, p) and \mathrm{NB}(s, p) is a draw from a distribution \mathrm{NB}(r+s, p). There is a derivation of this here: https://math.stackexchange.com/questions/1054048/negative-binomial-distribution-sum-of-two-random-variables

The parameterisation used in Stan has \mu = r (1-p)/p and \phi = r, so the sum of distributions \mathrm{NB}(\mu_s, \phi_s) and \mathrm{NB}(\mu_r, \phi_r) has \mu = (r+s)(1-p)/p, and \phi=(r+s). If you were adding together N draws with the same \mu and \phi, then the summed distribution would have parameters N\mu and N\phi. More of a sketch than a proof, let me know if you would like me to try and present it clearer!

5 Likes

I am aware that there may be other points of view here. In my situation, I want to treat my aggregated data as arising from independent draws from an underlying negative binomial, but maybe other points of view are valid. In particular, I can see how the view that the negative binomial is a Poisson-Gamma mixture, \mathrm{NB}(\mu, \phi) = \mathrm{Poisson}(\mu \mathrm{Gamma}(\phi, \phi)) could lead to the conclusion that the rate should only be applied to the mean, \mu. So, maybe it isn’t something to be changed in the existing BRMS, but for users to be aware of, so that they understand the assumption they are making. Hopefully other negative-binomial users join in the discussion.

1 Like

To clarify, here is the use case I have in mind for the “rate” addition term. Suppose we model a rate Y / T, where T is know. Y is a count and T is usually some measure of time during which Y was observed.

We can then define a regression model so that E(Y / T) = exp(eta), which can be transformed to E(Y) = exp(eta + log(T)) independently of the distributional assumption of Y which can by poisson, negbinomial or something else, as long as we use a mean-parameterization for that distribution.

It sounds to me as if you have something else in mind when you use “rate”, or am I mistaken?

1 Like

This is what I am after as well (my main application is to seabird bycatch in fisheries, and I am looking at captures over varying amounts of fishing effort). I am making the assumption that the process is independent, so, for example, I can treat counts of seabirds caught in a seven hour period as the sum of counts over seven one hour periods, and still get the same estimates for the parameters of the distribution. This means that the concept of a rate makes sense: the number of birds caught per hour has a distribution with parameters I can estimate, and I get the same estimate of those parameters (with more or less uncertainty) no matter how long I carry out my observations for. In the current implementation, as I watch for longer I get an ever increasing estimate of \phi, and so the distribution of birds caught per hour isn’t well defined. The issue is exacerbated when I have a whole lot of observations across different time periods, and I want to bring them into the same regression.

Thank you for your insights and also the clear derivation of the situation. You make a compelling argument I think. Before I actually go ahead and make that change, are their any arguments in favor of just adjusting the mean? That is, is there a derivation rooted in natural processes described as Y / T where we would like to use the pair N \mu and \phi instead of N \mu and N \phi as the parameters of the negative binomial distribution?

1 Like

Good question! I realise that I haven’t given much thought to other points of view. The Negative Binomial can be viewed as a mixture distribution (with some unobserved process causing the mean of the Poisson to vary in a Gamma distributed way). From this point of view, maybe it seems natural to use the pair (N\mu, \phi), but you won’t get a stable estimate of \phi from taking counts over different intervals, and so I am still not sure that a rate is well defined.

However, if this is what you wanted, you can achieve it by using an offset term, which makes it explicit that you are only scaling the mean. When I try an offset term on the toy dataset, I get results that are the same as from `b_10’ (from above, which uses the current rate formulation), as you would expect:

> b_off = brm(y  ~ 1 + offset(log(N)), family=negbinomial, data=data[, .(.N, y=sum(y)), group])
> b_off
 Family: negbinomial 
  Links: mu = log; shape = identity 
Formula: y ~ 1 + offset(log(N)) 
   Data: data[, .(.N, y = sum(y)), group] (Number of observations: 100) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.49      0.02     1.45     1.54 1.00     3427     2877

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape    41.13     11.74    23.88    70.71 1.00     3306     1979

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

> b_10
 Family: negbinomial 
  Links: mu = log; shape = identity 
Formula: y | rate(N) ~ 1 
   Data: data[, .(.N, y = sum(y)), group] (Number of observations: 100) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     1.49      0.02     1.45     1.53 1.00     3203     2500

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape    40.55     11.14    23.78    67.60 1.00     3865     2427

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

(As an aside, however, note that I get different answers from these two representations of the offset:

brm(y ~ 1 + offset(log(N)), ...

and:
brm(y | offset(N) ~ 1, ...

In the second case, the mean isn’t scaled as I would expect —it seems that the offset is not applied.)

At the moment, I have a perfectly good workaround, thanks to the magic of custom families, so if you wanted to see if anyone else had opinions on this scaling issue before making a decision, that would be fine by me! It may be that there are similar potential issues in other distributions where the rate parameter is used. I haven’t investigated beyond the negative binomial.

1 Like

Thank you. In fact, rate has only been a convenience functionality for adding an offset term to not require users working with offsets. But I see now that this can have more general use as we can adjust the negative binomial distribution in a more principled manner using rate. I will make this change accordingly.

As a side note y | offset(N) won’t work (in fact it should return an error), because there is no offset addition term. I will check why this does not return an error in the first place.

2 Likes

Food for thought here. For comparison, I’ll mention how IMHO most people would think about this problem in analysis of gene expression:

  • We believe that the actual counts of RNA molecules in a population of homogenous cells is NB distributed.
  • When you extract the RNA from cells you randomly loose some molecules (roughly a binomial sampling)
  • There is a PCR amplification step that is messy and probably impossible to model well, everybody looks away and ignores this in modelling. It mostly multiplies each RNA molecule a lot of times.
  • In the sequencing machine, a random subset of amplified RNA is sequenced (roughly binomial sampling)

This means that rate in this context would roughly correspond to the total number of reads you got from your sequencing run (in practice, people are a bit more clever about of this, but that’s the gist).

But if you have the original number of molecules distributed as Y_{raw} \sim NB_2(\mu, \phi) and then have Y_{obs} \sim Binomial(Y_{raw}, \theta) it follows that Y_{obs} \sim NB_2(\theta \mu, \phi) so the dispersion parameter doesn’t scale with rate.

Not sure if a similar situation happens in other neg. binom uses.

But this post brought my attention to a problem in this approach: if I look at molecule counts in individual cells (which are mostly NB_2 distributed), and treat the whole sample as a sum of the molecules in the cells, then the overdispersion parameter should scale with the number of cells, so if I can’t guarantee that all samples have roughly the same number of cells, then while comparing samples, I have mix of two behaviors: differences in cell counts (changes overdispersion) and differences in sample preparation + sequencing (doesn’t change overdispersion) and it is likely hard/impossible to disentangle those two… Will need to think about this some more!

2 Likes

The change is now implemented in the github version of brms. @edwardabraham would you like to double check the created Stan code?

Thanks so much! I will test it out when I get a chance (likely this weekend).

Thanks for this example. Would be interesting to test things out by tinkering with some toy data, generated in a known way.

Thought a bit more how to present this to users, here are my few cents (feel free to disagree and please double check my reasoning).

The overall use case for rate in count models (not sure how this applies to continuous log-linear models):

We are observing discrete events and want to account for known varying effort expended to observe those events (time spent observing, total molecules sequenced, number of detectors, …)

This varying effort might happen in at least two ways:

  1. For each unit of effort, the number of events is drawn independently for the underlying distribution. The observed total count is the sum of events across all units of effort.
  2. There is some (usually big) number of unobserved events drawn once from the underlying distribution and fixed for each observation. The amount of effort determines the proportion of the unobserved events we get to observe. This means that when I would divide one observation into two with half the effort (e.g. two sequencing runs from the same prepared sample), the observed counts might not be independent.

For Poisson response, 1) and 2) lead to exactly the same model. For neg. binomial, 1) means that shape should be scaled as @edwardabraham proposes, 2) means that shape should not be scaled.

On practical level, I think 2) might be rarer and it is probably OK to make users handle this via offset, but having at least a hint of this distinction in the docs might IMHO be useful. The only other example of 2) that is not sequencing that I can think of is something like having varying number of detectors with independent chance to observe an event and then making sure you count events noticed by multiple detectors only once.

Hope that’s clear.

Indeed, the rate term in brms was made with (1) in mind and before your comments I wasn’t aware of the situations that lead to (2). Thank you for bringing this to my attention.

With regard to (2), is the proportion of observed / unobserved events assumed to be known or rather modeled as a parameter?

1 Like

In sequencing, you know the relative proportion (i.e. you get a vector of multipliers with mean 1). I am not sure there are other actual use cases that behave like sequencing though and can’t speak about their needs.

Also sequencing data are usually a bit too big for brms and ML/MAP methods are used almost exclusively - a small experiment would have 6 samples x 5K genes, giving you 30 K rows in the data and ~ 15 K parameters (mean + between group difference + overdispersion per gene with overdispersion usually being quite constrained with some hierarchical stuff). This might be tractable with GPU / cluster, or it might be OK to split into multiple parts (as the hierarchical parameters get well constrained with much fewer genes), but maybe it will never become a common use case for brms.

I’m going to have to sit down and think about (2) … may take me a little while ☺️

Confirming that my test above now passes—the parameters of b_10 are now as expected :party_parrot: and the contribution to the likelihood is what I would expect:

target += neg_binomial_2_log_lpmf(Y | mu + log_denom, shape * denom);

One slight puzzle: the documentation in brmsfamily (and as specified in families.R) says it is negbinomial(link = "log", link_shape = "log"), but it looks from the the likelihood contribution that the shape parameter has an identity link function, and it is not log transformed in the summary of the model fit.

Good to see you have also fixed the geometric distribution, and removed the rate parameter from other families, so I have no more homework to do there :) Thanks once again!