General Question: How can you improve a model after seeing (via the posterior predictive distribution), that it poorly recovers your data? Most of the recommendations I hear are to change the distribution of the likelihood. In this case (as I explain below) my
y variable is on the proportion scale, so I think my choice of a beta distribution for
y is vaild, especially considering how flexible the beta is.
Data/Model: I am fitting a multilevel beta regression model with
brms (version 2.17.0) using continuous proportion data as an outcome that have been transformed (per Smithson & Verkuilen, 2006:
y*(n - 1) + 0.5)/n to fit the open (0, 1) interval. 6% of the un-transformed data were zeros, and 0.6% were ones. The observations are the amount of daily energy derived from one species of plant by a foraging animal that generally eats many species plants throughout the day. Predictor variable
x is a continuous variable ranging from about 0-14 (a metric on the percentage scale estimating forest productivity). The data are collected over about 15 years and all observations in a given month will have the same
x variable. There are 114
id groups, which are individual animals.
sex is a binary variable with 1 coded for females and the
time_feed variable is the total amount of time spent feeding in a given day (scaled). Number of observations is over 5,700.
Here is my model structure:
beta_fit <- brm(formula = y ~ x + sex + I(time_feed * x) + (x | id), data = data, family ="beta", prior = c(prior(student_t(3, 0, 2.5), class = "Intercept"), prior(normal(0, 1), class = "b"). prior(gamma(0.01, 0.01, class = "phi")), chains = 4, iter = 4000, warmup = 1000, cores = 2, seed = 1234, control = list(adapt_delta = 0.95))
The model converges, there are no alarming Pareto k values, but when looking at the posterior predictive distribution, it seems like the model predictions don’t recover the actual data very well, as seen here:
Seeking Advice: Any suggestions to getting a better fit? Or is it okay to move forward given that the general shape of the posterior predictive distribution follows my data, and that the direction and magnitude of the relationship are of most interest to me? Could my data just be too noisy? Here is a plot with 100 draws from the expectation of the posterior predictive distribution overlain on a scatterplot of the data (ignoring the interaction for now):
Thanks for your help!