Help choosing a family for modelling

I have a count data including stand and end day of rehabilitation. All patients received rehabilitation. As the data includes some zeros, I replace those with 0.01-s as it allows running lognormal models.

library(tidyverse)

data = read_csv("https://www.dropbox.com/scl/fi/ok24el8mox7zcal555ynt/data.csv?rlkey=5al2ap9rn3xwoejec1z7fd964&dl=1") %>% 
  sample_frac(0.1) %>% #reduce data size
  replace(. == 0, 0.01) #replace zero values

data

image

Variables are distributed as follows

Density plots

  ggplot()+
  geom_density(data = data, aes(start_day), color = "red")+
  geom_vline(xintercept =  median(data$start_day), color = "red", linetype = 3)+
  geom_density(data = data, aes(end_day), color = "blue")+
  geom_vline(xintercept = median(data$end_day), color = "blue", linetype = 3)+
  ggtitle("Start day red, end day blue and lines for medians")+
  scale_x_continuous(breaks =seq(0, 365, 20))

Tried two lognormal models, but their fits do not seem that good

library(brms)

start_lognormal = brm(data = data, start_day ~ 1, lognormal)

start_lognormal %>% pp_check() + xlim(0,200)

end_lognormal = brm(data = data, end_day ~ 1, lognormal)

end_lognormal %>% pp_check() + xlim(0,200)

Any recommendations to get a better fit?

Any other ideas how to model such distributions?

Hi @di4oi4, do you mean count data in the literal sense that start_day and end_day represent countable quantities? It looks like they do, and your comment about zeros is consistent with that.

Did you try the Poisson or negative binomial families?

2 Likes

Thanks for the suggestion! Tried these but the fits are still not that good. Any other options?

Well those are just intercept-only models, right? Presuming that you will want to implement some predictors, these PPC are a decent starting point but might not be that informative yet. Based on the form of the data I suspect the negative binomial might be a reasonable starting point.

3 Likes

Agreed. Negative binomial sounds like a great foundation, presuming you do not change any of those zero values to 0.01-s. Speaking of which, if you have a lot of zeros, you might consider a zero-inflated negative binomial model (link).

3 Likes

Relatedly, would you mind showing us the output of

pp_check(my_nb_fit, type = "hist", binwidth = 1, ndraw = 8)

where my_nb_fit is your negative-binomial model? Here’s my reasoning: If the histograms of simulated data have notably fewer 0’s than the histogram of the observed data, that’s your hint you want a zero-inflated model.

1 Like

I agree with @Solomon here. I would also suggest a negative (ZI) binomial model, if you want some “evidence” for this choice, you could run a LOO-CV to check whether it fits the data better than the poisson model (kinda the same as visually checking the pp_check like Solomon suggested), the same can be done for the zero-inflated one. See page 13 of this guide for some intuiton/references: https://compass.onlinelibrary.wiley.com/doi/pdf/10.1111/lnc3.12439
Poisson regression for linguists: A tutorial
introduction to modelling count data with brms

1 Like