Recommendation for truncated count data?

mike-lawrence · June 30, 2017, 2:06pm

A colleague has collected questionnaire data asking respondents to indicate the number of days, in the past 30 days, that the respondent used a variety of substances. So, for each respondent and substance we observe an integer count from 0-30. Eyeballing the data, some substances have high use rates (ex. smoking) such that responses are clustered up at 30, while others are more distributed through the range. For a given substance, any suggestions for an appropriate distribution? Maybe neg_binomial_2_log(mu,phi) T[,30]?

sakrejda · June 30, 2017, 2:29pm

Maybe this isn’t helpful but I would take a step back and consider the process—since you mention smoking you can smoke every day pretty easily but if you’re using alcohol daily your schedule looks very different from the guy doing a pack of cigarettes a day. To me that suggests modeling the time commitment might put these different things on a more common scale. But short of deeper thinking yeah truncated negative binomial sounds fine as a base.

aaronjg · June 30, 2017, 8:31pm

I would suggest a beta-binomial distribution with n = 30. With the truncated negative binomial you won’t be able to put as much mass on n=30. The beta-binomial distribution has the nice property that it can take on a bimodial shape with mass at each end - which sounds like what you have.

Andre_Pfeuffer · July 1, 2017, 2:10am

Just that you observe count data up to 30 doesn’t imply the distribution is truncated.
It sound to me, your data is not homogeneous, eg. for smoking you receive total different
counts than for others.

But if you want to throw a distribution at your data which fits it all, I’d recommend the Generalized
Poisson Lindley Distribution. I coded it once up in Stan:

http://www.ajs.or.at/index.php/ajs/article/view/vol44-4-3/92

arya · July 1, 2017, 3:18am

@Andre_Pfeuffer never heard of the Poisson Lindley Distribution. That’s pretty cool, thanks for that. I will remember that next time I have count data.

I was wondering how people come up with new distributions usually. Is it possible to come up with an exponential family distribution that’s heavy-tailed?

Andre_Pfeuffer · July 1, 2017, 3:44am

Take a look at:
https://reference.wolfram.com/language/ref/CompoundPoissonDistribution.html
lambda in poisson is following a exponential Distribution, Y ~ poisson(lambda),
lambda ~ exponential(mu).
This can be written as an Integral formulation: https://en.wikipedia.org/wiki/Marginal_distribution
Thus, you may use the integrated out formula, that’s you see in the papers, or
you could in Stan also use lambda ~ exponential(mu);
The closed form is more beautiful and efficient.
Here are derivations of the negative binomial etc., not they are different, since
they base upon: exp(mu)*lambda see (1) in:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180062/#FD11
Thus parameters of distributions are considered to have a distribution as well.
People combine them and get fancy about “their” new distribution.
You could also model it as exp(mu)*lambda, and lambda ~ exponential(mu2)
It’s worth to think about the differences. And also worth to qqplot your fitted data
against the newly developed distribution.

betanalpha · July 1, 2017, 7:20pm

Truncation can often have awkward effects on distributions unless you can pair it with informative priors that keep the bulk of the distribution away from the truncation point. Another option is just to model the 30 options with a simplex, perhaps with an informative Dirichlet prior. If you can avoid multimodality then this would give you the most flexibility.

jeremy.koster · July 1, 2017, 11:45pm

The clustering at the maximum is a consideration, I know. As a general comment, though, the denominator is known in this case so it wouldn’t preclude a binomial model.

Topic		Replies	Views
Help! Count Data Model Modeling specification , example-models	1	528	August 25, 2023
Truncated model for neg_binomial_2 Modeling	20	1697	June 9, 2017
Beta binomial truncated takes 1000x more time than the non truncated version Modeling fitting-issues	10	774	May 28, 2021
Zero-truncated count data brms fitting-issues	7	2616	March 7, 2022
Treating a missing count data as a parameter and putting truncated binomial prior for it Modeling specification	7	402	October 26, 2023

Recommendation for truncated count data?

Related topics