I have a count data including stand and end day of rehabilitation. All patients received rehabilitation. As the data includes some zeros, I replace those with 0.01-s as it allows running lognormal models.
library(tidyverse)
data = read_csv("https://www.dropbox.com/scl/fi/ok24el8mox7zcal555ynt/data.csv?rlkey=5al2ap9rn3xwoejec1z7fd964&dl=1") %>%
sample_frac(0.1) %>% #reduce data size
replace(. == 0, 0.01) #replace zero values
data
Variables are distributed as follows
Density plots
ggplot()+
geom_density(data = data, aes(start_day), color = "red")+
geom_vline(xintercept = median(data$start_day), color = "red", linetype = 3)+
geom_density(data = data, aes(end_day), color = "blue")+
geom_vline(xintercept = median(data$end_day), color = "blue", linetype = 3)+
ggtitle("Start day red, end day blue and lines for medians")+
scale_x_continuous(breaks =seq(0, 365, 20))
Hi @di4oi4, do you mean count data in the literal sense that start_day and end_day represent countable quantities? It looks like they do, and your comment about zeros is consistent with that.
Did you try the Poisson or negative binomial families?
Well those are just intercept-only models, right? Presuming that you will want to implement some predictors, these PPC are a decent starting point but might not be that informative yet. Based on the form of the data I suspect the negative binomial might be a reasonable starting point.
Agreed. Negative binomial sounds like a great foundation, presuming you do not change any of those zero values to 0.01-s. Speaking of which, if you have a lot of zeros, you might consider a zero-inflated negative binomial model (link).
Relatedly, would you mind showing us the output of
pp_check(my_nb_fit, type = "hist", binwidth = 1, ndraw = 8)
where my_nb_fit is your negative-binomial model? Here’s my reasoning: If the histograms of simulated data have notably fewer 0’s than the histogram of the observed data, that’s your hint you want a zero-inflated model.
I agree with @Solomon here. I would also suggest a negative (ZI) binomial model, if you want some “evidence” for this choice, you could run a LOO-CV to check whether it fits the data better than the poisson model (kinda the same as visually checking the pp_check like Solomon suggested), the same can be done for the zero-inflated one. See page 13 of this guide for some intuiton/references: https://compass.onlinelibrary.wiley.com/doi/pdf/10.1111/lnc3.12439
Poisson regression for linguists: A tutorial
introduction to modelling count data with brms