The negbinomial distribution allows specification of a rate term in the formula. However, in a negative binomial distribution, the rate should scale the shape (\phi) parameter as well.
While for a Poisson distribution the sum of N draws from a distribution of mean \mu has the same distribution as a draw from a distribution with mean N\mu ; for a negative binomial distribution with mean \mu and shape \phi ,the sum of N draws has the same distribution as a draw from a distribution with mean N\mu and with shape N\phi.
Not accounting for this correctly is a frequent error in applying negative binomial models to data with unequal sample sizes.
In the code below, the issue is illustrated for a simple dataset. Samples are drawn from a negative binomial, and the parameters of the distribution are estimated by fitting a negative binomial to the full dataset, and to a dataset that has been aggregated (by summing in groups of ten). Note that we are using an integer, N to scale the mean, but this could be a real number (for example, from counts taken over different durations).
library(brms)
library(data.table)
set.seed(5)
#Sample from a negative binomial with a known distribution
r = 3
p = 0.4
#The distribution should have this mean and shape:
mu = r*(1-p)/p #4.5
phi = r #3
data = data.table(N=1, y=rnbinom(1000, size=3, prob=0.4), group=rep(seq(100), 10))
# Now fit a model to the full dataset, and to an aggregated dataset (aggregated into groups of 10):
b_1 = brm(y | rate(N) ~ 1, family=negbinomial, data=data, save_model='nb_test.stan')
b_10 = brm(y | rate(N) ~ 1, family=negbinomial, data=data[, .(.N, y=sum(y)), group])
#A little helper function to check that the parameter is within the credible interval
between <- function(fit, value, parameter){return(
(value > posterior_summary(fit)[parameter, 'Q2.5']) &
(value < posterior_summary(fit)[parameter, 'Q97.5'])
)}
#The means from the full and aggregated datasets are as expected
stopifnot(between(b_1, log(mu), 'b_Intercept'))
stopifnot(between(b_10, log(mu), 'b_Intercept'))
# The shape from the full dataset is as expected:
stopifnot(between(b_1, phi, 'shape'))
# But the shape from the aggregated dataset gives an error
stopifnot(between(b_10, phi, 'shape'))
#Error: between(b_10, phi, "shape") is not TRUE
# In fact `posterior_summary(b_10)['shape', 'Estimate']` is around 60, around
# ten times higher than it should be
In the Stan model generated from BRMS, the contribution to the negative binomial log-likelihood is
target += neg_binomial_2_log_lpmf(Y | mu, shape);
however, I believe that it should be
target += neg_binomial_2_log_lpmf(Y | mu, exp(log_denom) * shape);
.
In the meantime, I have been handling this with a custom family. Putting this here in case it is helpful to anyone else who has this issue, or is working with negative binomial models applied to data with varying sample sizes.
negbinomial_rate <- custom_family(
"negbinomial_rate",
dpars = c("mu", "phi"),
links = c("log", "identity"),
lb = c(0, 0),
type = "int",
vars = "vreal1[n]"
)
stan_funs <- "
real negbinomial_rate_lpmf(int y, real mu, real phi, real V) {
return neg_binomial_2_lpmf(y | mu * V, phi * V);
}
int negbinomial_rate_rng(real mu, real phi, real V) {
return neg_binomial_2_rng(mu * V, phi * V);
}
"
stanvars <- stanvar(scode = stan_funs, block = "functions")
b_custom = brm(y | vreal(N) ~ 1, family=negbinomial_rate,
data=data[, .(.N, y=sum(y)), group], stanvars=stanvars)
#This now passes the test without any trouble!
stopifnot(between(b_custom, log(mu), 'b_Intercept'))
stopifnot(between(b_custom, phi, 'phi'))
- Operating System: Ubuntu 18.04
- brms Version: 2.13.0