Parameter estimation for the sum of negative binomial distributions


I am very new to Stan, and I would really appreciate some guidance and advice on modelling a problem. I have count data from three underlying sources. I assume each source can be modelled as a negative binomial distribution NB(r, p), where r is the number of failures until the experiment is stopped and p is the probability of a failure. So, my model is:

y \sim NB(r_1, p_1) + NB(r_2 p_2) + NB(r_3, p_3)

and I’d like to estimate the parameters of the NB distributions given the data y. But I thought this model is too complicated to solve even with Stan (please correct me if I’m mistaken). So, I make a simplifying assumption that p_1 = p_2 = p_3 = p, and so my model is now:

y \sim NB(r_1 + r_2 + r_3, p)

I also have some additional information I can leverage. For some of the y, the data only comes from either one or two sources, such that:

y_X \sim NB(r_1, p)
y_Y \sim NB(r_1 + r_2, p)
y_Z \sim NB(r_1 + r_2 + r_3, p)

and I’m given y_X, y_Y, and y_Z. I also know that r_1 < r_2 << r_3. Hopefully, this description is clear. I’ve constructed the model below:

data {
  int<lower=0> NX;
  int<lower=0> yX[NX];
  int<lower=0> NY;
  int<lower=0> yY[NY];
  int<lower=0> NZ;
  int<lower=0> yZ[NZ];
parameters {
  positive_ordered[3] r;
  real<lower=0.0, upper=1.0> p;
model {
  real muX;
  real muZ;
  real muY;

  real sigmaSqX;
  real sigmaSqY;
  real sigmaSqZ;
  real phiX;
  real phiY;
  real phiZ;
  muX = ((r[1] * (1 - p)) / p); 
  muY = (((r[1] + r[2]) * (1 - p)) / p);
  muZ = ((((r[1] + r[2]) + r[3]) * (1 - p)) / p); 

  sigmaSqX = (((r[1] * (1 - p)) / p) ^ 2); 
  sigmaSqY = ((((r[1] + r[2]) * (1 - p)) / p) ^ 2);
  sigmaSqZ = (((((r[1] + r[2]) + r[3]) * (1 - p)) / p) ^ 2); 
  phiX = (muX ^ (2 / (sigmaSqX - muX))); 
  phiZ = (muZ ^ (2 / (sigmaSqZ - muZ)));
  phiY = (muY ^ (2 / (sigmaSqY - muY)));

  p ~ beta(0.001,10);
  r[1] ~ gamma(0.01,0.01);
  r[2] ~ gamma(0.01,0.01);
  r[3] ~ gamma(5,10);

  yX ~ neg_binomial_2(muX, phiX);
  yY ~ neg_binomial_2(muY, phiY);
  yZ ~ neg_binomial_2(muZ, phiZ);

Everything seems to work fine, and I’m getting results that look right. I’m wondering though if my Stan implementation can be improved.

Thank you for any feedback, advice or suggestions,

1 Like

Since you have measurements that correspond to different parts of the sum, then you might be able to estimate p1, p2, and p3 separately.

Otherwise assuming the r/p → mu/phi transformations are alright, then this seems fine.

Is the goal of the gamma(0.01, 0.01) priors to make a spike near zero? I was doing draws from rgamma(5, 0.01, rate = 0.01) in R (which I think corresponds to the gamma in Stan and the numbers look really small. But this is the number of failures parameter so shouldn’t this at least be 1?

1 Like

Thanks for responding. As I see it, the other problem with estimating p_1, p_2, and p_3 individually, is that getting the individual r_i and p_i is also going to be complicated because y_Y and y_Z are sums of NB distributions. And what I really am after is getting the individual parameters. As far as I know the distribution of the sum of NB(r_1, p_1) + NB(r_2, p_2) does not have a nice closed-form solution. It’d be nice if I could do something like:

yY ~ neg_binomial_2(mu1, phi1) + neg_binomial_2(mu2, phi2)

but of course, I can’t right?

Ya, you’re right, I want a spike near zero and the number of failures is small. I guess I’m kind of doing this upside down, but I’m modelling my count numbers as failures, so a failure is really a “success” in my model. I guess it’s not completely intuitive, so let me think about it some more. :)

Thanks again for the feedback. If you have any other suggestions or thoughts, please let me know, I really appreciate it.

1 Like

Oooh yeah good point. I don’t know of a way around that.

I don’t have any other suggestions off hand. If you’re able to generate data from your model and fit it back and that looks vaguely good, that’s a super solid step 1.

You might look here Bayesian Workflow « Statistical Modeling, Causal Inference, and Social Science for ideas on next steps based on where you are with your modeling.


Thanks so much for the advice and feedback. Everything seems to be working fine, so I guess so far so good. I’ll have a look at the article you sent too, thanks again!

Check this out. @martinmodrak can give more details.


Thanks @maxbiostat. I’ll look through the blog by @martinmodrak. I have very limited experience with saddlepoint approximations though and my impression was that they’re not very good and don’t have full support. I’ve seen some implementations where they try to improve the approximations by adopting second-order cumulants, for example. But then it starts getting complicated. Again, please correct me if I’m mistaken. In my case, the endpoints are important and I don’t have a large amount of data, so I figured the approximation of assuming common p would be more appropriate.

Happy to hear anyone’s input on the matter though, especially if @martinmodrak has suggestions since he seems to have thought about this problem a fair amount. Thanks again.

You’ll note in the blog that I actually struggle to find a scenario where the saddlepoint approximation is better than approximating with a single negative binomial, even in the case of unequal p. I would guess that the approximation in the blog (i.e. find NB parameters to make the distribution have the same mean and variance as the sum) would perform better than assuming common p. There might be some benefit in treating the p_i hierarchically, i.e. assume a priori that they are similar. If the fitted p are actually the same, the approximation should reduce to exactly the case you use, so you shouldn’t be losing anything.

Best of luck with the model


First, a big thank you to @martinmodrak for your insight. I read your blogpost and tried to follow your lead for my model. Here’s the code for reference, and hopefully it’ll help anyone else working on a similar problem:

functions {
  /* Using the method of moments approximations for computing mu and phi overall.
   * This approach follows that in Martin Modrak's blog (
  real negBinomialSumMoments_lpmf(int[] sumY, vector mu, vector phi) {
    real muOverall = sum(mu);
    real phiOverall = square(muOverall) / sum(square(mu) ./ phi);
    return neg_binomial_2_lpmf(sumY | muOverall, phiOverall);

data {
  // These are the input data sizes, followed by the actual observed data.
  int<lower=0> NX;
  int<lower=0> NY;
  int<lower=0> NZ;
  int yX[NX];
  int yY[NY];
  int yZ[NZ];

parameters {
  positive_ordered[3] mu;
  vector<lower=0>[3] cv; // coefficient of variation

transformed parameters {
  vector[3] phi;
  vector[3] variances;
  phi[1] = inv(square(cv[1]));
  phi[2] = inv(square(cv[2]));
  phi[3] = inv(square(cv[3]));
  variances = mu + square(mu) ./ phi;

model {
  cv[1] ~ exponential(1);
  cv[2] ~ exponential(1);
  cv[3] ~ exponential(1);
  mu[1] ~ gamma(0.01, 0.01);
  mu[2] ~ gamma(0.01, 0.01);
  mu[3] ~ gamma(5, 10);
  yX ~ neg_binomial_2(mu[1], phi[1]);
  yY ~ negBinomialSumMoments(mu[:2], phi[:2]);
  yZ ~ negBinomialSumMoments(mu, phi);

I used 20,000 iterations and 5 chains. The results are not really what I hoped for though. The main issue is that the means are pretty much identical and it seems that the \phi_i capture the differences among the three NB distributions. My assumption is that the mean of y_X should be small, y_Y moderate, and y_Z highest. Hence, the gamma priors I chose. It’s very possible my implementation isn’t what you had in mind, but please let me know if you (@martinmodrak) or anyone else, has any suggestions or corrections.

I appreciate all the help as always.

1 Like

I don’t think those priors do this. gamma(0.01, 0.01) has mean 1 while gamma(5, 10) has mean 0.5, and only 1% prior probability for values > 1.16. Those would interact with the positive_ordered constraint - I am not completely sure what happens as you assume order, and give priors that “fight” the ordering, but the overall effect could plausibly be a quite narrow joint prior.

When I simulate data that are not in conflict with the prior, e.g.

options(mc.cores = 4)
m <- cmdstan_model("nb_sum.stan") # This is the model you shared

N <- 50
mu <- c(0.1, 0.8, 1.0)
cv <- c(1.5, 2.1, 0.3)
phi <- 1 / cv^2

yX <- rnbinom(N, mu = mu[1], size = phi[1])
yY <- rnbinom(N, mu = mu[1], size = phi[1]) + 
  rnbinom(N, mu = mu[2], size = phi[2])
yZ <- rnbinom(N, mu = mu[1], size = phi[1]) + 
  rnbinom(N, mu = mu[2], size = phi[2]) + 
  rnbinom(N, mu = mu[3], size = phi[3])

data <- list(NX = N, NY = N, NZ = N, yX = yX, yY = yY, yZ = yZ)

fit m$sample(data = data)

I get very reasonable inferences that cluster close to the true values from simulated data:

  variable          mean   median           sd    mad        q5      q95  rhat ess_bulk ess_tail
   <chr>            <dbl>    <dbl>        <dbl>  <dbl>     <dbl>    <dbl> <dbl>    <dbl>    <dbl>
 2 mu[1]            0.173    0.160       0.0747 0.0633    0.0773    0.314  1.00    1098.     882.
 3 mu[2]            0.644    0.640       0.164  0.167     0.373     0.922  1.00    1224.     912.
 4 mu[3]            0.977    0.955       0.206  0.203     0.681     1.36   1.00    2759.    2252.
 5 cv[1]            1.34     1.20        0.960  0.986     0.0975    3.13   1.00    1310.    1213.
 6 cv[2]            2.50     2.47        0.567  0.506     1.65      3.45   1.00    1176.     752.
 7 cv[3]            0.852    0.817       0.566  0.666     0.0565    1.80   1.00    1386.    1031.

So I suspect you are dealing with a prior vs. data conflict.

Why did you need so many iterations? Were there other issues?

I don’t think that’s correct transformation between coefficient of variation (using the Wikipedia definition) and the phi parameter of NB2. Note that for NB2, you have (unless I am messing up big - which is possible)

\mathrm{CV} = \frac{\sqrt{\mu \left( 1 + \frac{\mu}{\phi} \right) }}{\mu} = \\ = \sqrt{\frac{1}{\mu} + \frac{1}{\phi}}

So the transform to \phi has to depend on \mu.

Best of luck with the model!


@martinmodrak thank you again, especially for taking the time to run the code and simulate data, etc. I had concerns about the priors, and you confirmed my thoughts. It turns out that I assumed the wrong parameterization for gamma, and I should have double-checked, so thanks for pointing out that problem. I’ve changed the priors and tried some other combinations:

mu[1] ~ gamma(0.01, 0.01);
mu[2] ~ gamma(0.01, 0.01);
mu[3] ~ gamma(0.1, 0.01);

I should have clarified that I ran up to 20,000 iterations to make sure there wasn’t a problem with warmup. But 4,000 iterations worked fine too.

I absolutely agree with you, and this issue troubled me because like you, my algebra included a \mu term too. I followed the suggestion from this post: Prior distributions on shape parameters « Statistical Modeling, Causal Inference, and Social Science

The post doesn’t define what is the shape parameter so it’s possible I could’ve misinterpreted. I just assumed that the coefficient of variation is dominated by \phi and the \frac{1}{\mu} term doesn’t matter much. I tested this hypothesis, and changed my \phi computation to:

vector[3] phi = inv(square(cv) - inv(mu));

but it didn’t seem to change my results.

My conclusion is that for my case at least, I don’t have enough data to get away from the prior. Thanks again for the discussion and the well wishes. Happy to hear any other comments as well. This discussion has been extremely enlightening to me, I really appreciate it.


There seems to be something missing…

The method suggested in the post is a good way to put a prior on the shape parameter, it just doesn’t correspond to a prior on coefficient of variation of the NB, that’s all :-). I thought putting priors on CV was important for your case.

EDIT: I missed that it puts a prior on CV of the latent gamma, so that’s obviously a misunderstanding on my side as I interpreted it as CV of the negative binomial.

That is obviously a possible outcome, but I would expect that at least for mu[1] you should get reasonable inferences even without too much data. How big are your NX, NY, NZ?

You are welcome, glad to have helped.

@martinmodrak Sorry, I realized I accidentally deleted the code when I quoted your post. I edited the post.

Yeah the prior for CV is important.

There are ~3000, ~5000, and ~12000 points for N_X, N_Y, and N_Z, respectively. I don’t have much experience with this problem, so I just assumed my N_i are low.

That’s weird - I did get relatively narrow posterior intervals already with all Ns = 50. 3000 observations should basically completely determine the mean of a negative binomial… Can you share (part of) your data and/or posterior summaries from the model?

Sure, I sent you a message. I can share the data privately.

UPDATE After talking to @martinmodrak some more, it seems that his solution of using the method of moments approximation is likely the best approach for my problem. I do need to work on the priors though. Please ignore my conclusion above, as it seems that I do indeed have enough data to get away from the prior.