I’m presently engaged in researching my thesis, focusing on the connection between poor mental well-being and economic progress, specifically examining its impact through the lens of absenteeism.
Absenteeism, which indicates the number of instances employees are absent due to illness, is my dependent variable. This data involves count values, and notably, a significant proportion of zero entries. Ranging from 0 to 360 days, it quantifies the absence in the past year.
Given the likelihood of overdispersion, I’m considering the negative binomial model. However, I’m uncertain about the most suitable count data model for my research. Should I use hurdle model instead?
Additionally, I’m unsure whether the theoretical model should be articulated like a linear regression or require a different approach for count data.
My data is cross-sectional and case-based. I kindly seek guidance in choosing the appropriate model and clarifying the structure for my theoretical model. Your insights are greatly appreciated.
I don’t think that all of your questions can be adequately answered by anyone on the forum, because as you are the expert on your topic and have the data and study design, you will have to think about and decide on the best analysis approach given all of the assumptions in your study design and data, and all of the prior knowledge on your research topic. I will try to answer a few of them though :-)
The negative binomial distribution is often a good choice of response distribution for overdispersed count data. In terms of “significant proportion of zeroes” that you describe in your data, you may or may not need to model those in a way beyond that which the negative binomial can model. It really depends on what you think the generative model of your data is. A hurdle model assumes that all of the zeroes come from a separate process. A zero-inflated model assumes that some of the zeroes come from another process and some from the process described by the negative binomial distribution. For example, suppose that absenteeism is relatively low, with many zeroes and a small mean, but is very overdispersed. If you fit a negative binomial model, you might find that it sufficiently models the zeroes, because the mean is low and the dispersion is great (in brms this would be a small mean and small shape parameter). Negative binomial distributions with these characteristics have a lot of zeroes anyway. However, suppose that you know that most of the zeroes come from men vs women, because women are more likely to take care of sick children or have maternity leave. Now, you might want to use a zero-inflated model and model the zero-inflation part using gender. Or perhaps a combination of gender and age. All of this really depends upon your theoretical model of the data generation process.
Generalized linear models allow you to represent your model as linear on the scale of the predictors using some link function. You could have a non-linear formula for your GLM if needed. It all really just depends on your theoretical model of the data generation process. For some examples, you could see this brms vignette that includes an example of zero-inflated models Estimating Distributional Models with brms or this vignette that includes examples of non-linear models Estimating Non-Linear Models with brms
You will have to decide this for yourself :-) If you provide your theoretical generative process via a DAG or some other type of description, and some example data, then maybe someone on the forum would have some insight into modeling.
I hope that helps some.