A colleague has collected questionnaire data asking respondents to indicate the number of days, in the past 30 days, that the respondent used a variety of substances. So, for each respondent and substance we observe an integer count from 0-30. Eyeballing the data, some substances have high use rates (ex. smoking) such that responses are clustered up at 30, while others are more distributed through the range. For a given substance, any suggestions for an appropriate distribution? Maybe neg_binomial_2_log(mu,phi) T[,30]?
Maybe this isn’t helpful but I would take a step back and consider the process—since you mention smoking you can smoke every day pretty easily but if you’re using alcohol daily your schedule looks very different from the guy doing a pack of cigarettes a day. To me that suggests modeling the time commitment might put these different things on a more common scale. But short of deeper thinking yeah truncated negative binomial sounds fine as a base.
I would suggest a beta-binomial distribution with n = 30. With the truncated negative binomial you won’t be able to put as much mass on n=30. The beta-binomial distribution has the nice property that it can take on a bimodial shape with mass at each end - which sounds like what you have.
Just that you observe count data up to 30 doesn’t imply the distribution is truncated.
It sound to me, your data is not homogeneous, eg. for smoking you receive total different
counts than for others.
But if you want to throw a distribution at your data which fits it all, I’d recommend the Generalized
Poisson Lindley Distribution. I coded it once up in Stan:
@Andre_Pfeuffer never heard of the Poisson Lindley Distribution. That’s pretty cool, thanks for that. I will remember that next time I have count data.
I was wondering how people come up with new distributions usually. Is it possible to come up with an exponential family distribution that’s heavy-tailed?
Take a look at:
lambda in poisson is following a exponential Distribution, Y ~ poisson(lambda),
lambda ~ exponential(mu).
This can be written as an Integral formulation: https://en.wikipedia.org/wiki/Marginal_distribution
Thus, you may use the integrated out formula, that’s you see in the papers, or
you could in Stan also use lambda ~ exponential(mu);
The closed form is more beautiful and efficient.
Here are derivations of the negative binomial etc., not they are different, since
they base upon: exp(mu)*lambda see (1) in:
Thus parameters of distributions are considered to have a distribution as well.
People combine them and get fancy about “their” new distribution.
You could also model it as exp(mu)*lambda, and lambda ~ exponential(mu2)
It’s worth to think about the differences. And also worth to qqplot your fitted data
against the newly developed distribution.
Truncation can often have awkward effects on distributions unless you can pair it with informative priors that keep the bulk of the distribution away from the truncation point. Another option is just to model the 30 options with a simplex, perhaps with an informative Dirichlet prior. If you can avoid multimodality then this would give you the most flexibility.
The clustering at the maximum is a consideration, I know. As a general comment, though, the denominator is known in this case so it wouldn’t preclude a binomial model.