Generating unrealistically high draws out of in posterior in zero-inflated negative binomial model

I am using brms to model a count process using a zero-inflated negative binomial model. The response variable in the data is overdispersed, and I am generally getting good mean predictions. But when I use posterior_predict, there are some extreme values that are unrealistically high, by an order of magnitude. If I understand the negative binomial distribution correctly, the number of trials needs to be capped, even as the probability of success increases.

Consider the following problem:

  • We are interested in the number of dollars a public speaker makes in his first 500 speaking engagements over a maximum of two years.
  • Better public speakers not only command more money per speaking engagement; they also get booked for more speaking engagements.
  • But they are limited to 500 engagements over a couple years, and even for the best speakers, there are some natural scheduling limitations to whether they will be able to book the maximum number of engagements.
  • Of course, there are some realistic limitations to how much any organization would offer someone for a speaking engagement.

I wanted to model this as a negative-binomial problem, as I am ostensibly just interested in the total count of dollars. Again, I am mostly having an issue with the variance of the posterior draws.

Does anyone have suggestions for the best way to go about handling this? Would I be best off using an asymmetrical prior? Upper bounds on the priors? Is my only option to model the underlying processes in a different fashion?

I have used both student_t and normal priors, and the latter does rein in the most extreme values. I have also tried setting priors on the shape variable.

Thank you for your help!

Informative priors would likely help, as you say. Both on the regression coefficients, which are modeled on the log-scale by default for neg-binomial models, as well as on the shape parameter. The latter has a somewhat suboptimal parameterization in the sense then larger values imply less overdispersion (neg-binomial converges to possion for shape going to infinity).

Without more details about the exact model you are trying to fit and on the variables you are using, it is hard to give you more specific advice.

As a start, can you show me the brm() call of your model and the pp_check() plot of you fitted model?

Can you also explain, what you mean by:

If I understand the negative binomial distribution correctly, the number of trials needs to be capped, even as the probability of success increases.

Based on the description of your problem, it appears that maybe the better way to estimates the total count of dollars is to separately model (a) number of appearances and (b) payment per appearances and then to calculate the total count as the product of (a) and (b).

Then you could model the number of appearances with the beta binomial distribution, which allows to explicitly put an upper limit on the number of appearances.

In pseudo code:
X1 are predictors for number of appearances, b1 are regression weights
expected_prop_appear = inv_logit(X1*b1)
y_number_appearances ~ beta_binomial(max_number_appearances, expected_prop_appear * phi, (1-expected_prop_appear) * phi)
phi is an overdispersion parameter

X2 are predictors for pay per appearance, could be the same as for number of appearances.
expected_pay_per_appearance = exp(X2*b2)
y_pay_per_appearance ~ neg_binomial_2(expected_pay_per_appearance,phi2)
phi2 is an overdispersion parameter

And finally
expected_total_pay = expected_prop_appear * max_number_appearances * expected_pay_per_appearance
y_total_pay ~ neg_binomial_2(expected_total_pay, phi3)

I am not sure if such a 2 step model can be implemented in brms,

1 Like

Thank you both! This community (and, Paul, the brms package) is incredible.

First of all, to be clear, I have been using brm_multiple, as I am using multiple imputation to impute the performance_rating_1 and performance_rating_2 when critic_grade is not missing. The months_since_becoming_public_speaker variable was an attempt to handle censored data. I have also put priors on individual coefficients, but this was my last run as I moved away from student_t priors to trying normal ones.

brm_multiple(bf(dollars ~ demographic_type_three_levels + z_performance_measure_1 + z_performance_measure_2 + mo(critic_grade) + mo(months_since_becoming_public_speaker),
zi ~ elo_rating_pref*demographic_type_three_levels + z_performance_1 + z_performance_measure_2 + z_performance_measure_3 + overall_experience + months_since_becoming_public_speaker),
data = imp,
family = zero_inflated_negbinomial(),
prior = c(set_prior("normal(0, 2)", class = "b"),
set_prior("normal(0, 2)", class = "Intercept") +
set_prior("normal(0, 2)", class = "b", dpar = "zi")),
control = list(adapt_delta = 0.93), iter = 4000)

I’m not sure the pp_check is going to be very helpful because of the scales here, but let me know if there are any plotting options that would help you here:

Thank you! I suspected that may be a better way to model the underlying process, though I am not sure if brms can implement such a model, either. Perhaps Paul can advise here. Am I correct in thinking that the more complicated part here is that, when sampling, I would want some sort of covariance structure between y_number_appearances and y_pay_per_appearance. (Perhaps counterintuitively, the more someone is making per appearance, the more appearances they would be expected to make since they are more in-demand.)

Paul, to be honest, I could not explain this well. I have been trying to understand the negative binomial distribution better and have seen a few different explanations of it, one of which included the concept of counting the number of trials to reach a count given a binomial probability. I think I was just “reaching” for any possible answer/intuition here.

From the pp_check plot it looks like very few predictions are that large. In which cases do you think will this be problematic for your post-processing of the results?

You may try to set a prior on the shape that keeps it away from zero (large variance). How does the current posterior-distribution of the shape parameter look like?

brms supports multivariate models, but this kind of dependency @Guido_Biele proposes (and which looks reasonably at first glance) cannot be handled by brms. The situation is further complicated by the use of a zero-inflated distribution. If you are willing to go this path, you may to amend the Stan code generated by brms (see make_stancode).

You’re right that there are very few predictions that are large, though for some individual observations, it is a non-negligible amount. For the best people, I am finding something like 1% to 4% of their predictions can be unreasonably large depending on the model I have run. I have wondered if I should just cap this post-hoc, though I was worried that would cover other issues in the model (not to mention raise issues of which mean value I should be using).

I have been getting values very close to zero, including when I set a very large prior a couple days ago (normal(30, 1)):

I will try again, as I didn’t really understand which direction the shape parameter worked before—would you suggest using a lower bound to keep it away from zero?

Thank you!

Oh I didn’t see you were predicting the shape as well. Indeed, based on what I see about the Intercept, your shape is estimated to be very small, which seems to be what the data suggests given that a very informative prior (normal(30, 1)) does not seem to help much

Are those 1 to 4% affecting your conclusions you draw from the model?

It looks like the code I pasted above omitted the shape parameter—must have accidentally deleted it. I have tried two things in estimating the shape parameter: estimated only an intercept, and also tried using a spline on the performance measure in the hopes that increased performance would curtail the dispersion some. The latter didn’t seem to make a huge practical difference.

The 1-4% on the top observations probably will not affect the conclusions too much, since my main interests are in the mean and then the probabilities of achieving different cut points. That said, I am generally more interested in distinguishing between and ranking the top end of the population than the bottom end. I may test if I would be better off zero inflating the data more—such that those with only one or two speaking engagements are essentially said to have zero. Since most of the population is on that low end, perhaps it would be better for the negative binomial part of the model to have to fit that end a little less.

In general, I suppose my main concern would just be to make sure these unrealistic draws don’t seriously impact the moments or the more meaningful part of the distribution. I want to make sure I’m doing the proper diligence in model checking.

@sean If I’m not misunderstanding the setup here, this seems to be begging for a joint model or a joint model factored into a model for number of engagements (where you handle any censoring/selection/other bias issues) and a conditional model for dollars given number of engagements. These models are fit simultaneously in the same Stan program and can share parameters where appropriate.

Have you tried something like that?

And glad you’re finding the Stan community helpful!

@jonah That makes a lot of sense, and sounds similar to what @Guido_Biele was suggesting above. I haven’t tried it yet, as I have been leaning on brms so far. It does seem that I may need to step outside brms, though the Stan code it generates will likely give me a start. Time to do some reading on joint models!

1 Like

Since I haven’t seen anyone address this so far:

A negative binomial models the amount of trials necessary to accrue a number of failures (or successes, depending on your point of view) with a given probability. A simple way to understand this is imagining the distribution of die throws before you roll a 1. The probability of such an event is not that small (1/6) and on average, it should take 5 throws before the experiment is over; however it’s perfectly possible end up on a sequence well into hundreds of throws. All the NB distribution will tell you is how rare that event might be.

In conclusion: that’s a feature of the distribution, not a bug. And very likely, also not what you want to represent any of your model components (appointments and/or earnings) as.