Poisson or not?

torkar · January 15, 2019, 3:05pm

Dear all,

I have a data set where I want to predict the outcome Effort depending on a number of predictors (8 in total).

My assumption is that I should use family=gaussian so,

fit <- brm(bf(Effort  ~ Input + Output + Enquiry + File + Interface + Added + 
             Changed + Deleted), 
           family=gaussian,
           prior=c(prior(normal(0,1), class=b)),
           cores=4, chains=4, data=foo)

ran fine, but when I did a posterior predictive check it looked ridiculous (Effort and all predictors are >=0 in the data set). Here is the data (N=555).

So, I thought that I was mistaken and picked Poisson instead, but many divergences and BFMIs with a simple model such as Effort ~ 1 + Input, didn’t make me any wiser (and I even checked the results from using zi).

Can someone please explain to me what is going on. I’m missing something obvious but I’m sure some of you know what that might be.

Not that it matters but:

> packageVersion("brms")
[1] ‘2.7.0’

Max_Mantei · January 15, 2019, 3:36pm

Maybe try Gamma with log-link. The Gamma has heteroskedastic variance with \phi \mu^2, which could make sense from a quick glance at the plot you gave.

Also, the log-normal could make sense, as it is similar to the Gamma. Idk how the log-normal is implemented in brms though and the Gamma is nice, because it fits the GLM framework.

paul.buerkner · January 15, 2019, 3:39pm

Can you explain your response variable Effort a bit more? Without understanding the nature of the response, it’s hard to give any advice on a good distribution.

torkar · January 15, 2019, 4:08pm

Thanks a bunch for the interest.

Effort is actual effort measured in person-hours for a complex (or not) system being developed.
The IVs (input, output, etc.) are units of measurement to express the amount of business functionality an information system (as a product) provides to a user.

All data are assembled and then quality assessed by an “expert” - the data we see here are of the “highest” quality… There are lots of zeros in the set and I intend to check what happens when I assume zeros are missing values and impute them.

paul.buerkner · January 15, 2019, 4:11pm

This is indeed some sort of “count” data (number of hours) so count models could be worth a try. If you have many zeros, you might consider zero-inflated or hurdle models, as well. See vignette("brms_distreg") for some examples.

torkar · January 15, 2019, 4:52pm

Good @paul.buerkner, then I wasn’t way off. I tried family=zero_inflated_poisson,

fit <- brm(bf(Effort  ~ 1 + Input + Deleted, zi ~ Deleted),
                       family = zero_inflated_poisson,
                       prior=c(prior(normal(0,1), class=b)),
                       cores=4, chains=4, data = foo)

where the zi is on Deleted (which contains many zeros). It pukes:

1: There were 6 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
2: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See

Next up, I’ll also check Gamma and Log_normal, before going for hurdle models (which I’ve never used).

torkar · January 15, 2019, 5:06pm

For Gamma I can’t even start sampling (even if setting inits=0). Gamma(log) with inits=0 actually looks ok, but I’ll compare with a hurdle model (since I want to learn how to use them also).

Max_Mantei · January 16, 2019, 9:04am

Sorry, I overlooked that Effort can be equal to zero. Then you would need to do a Gamma hurdle oder Log-normal hurdle model, since they have 0 probability density at zero.

Again, my guess would be that overdispersion is probably one issue the poisson has in your case.

You could also try with a negative binomial - a sort of mixture of poisson and gamma - with Zero-Inflation. This should take care of overdispersion. Another possibility is to fit a poisson with varying intercepts (I think this is described in the German/Hill book somewhere) - in my work I often found that a poisson regression a with varying intercepts is incredibly similar to the negative binomial. Which is not that surprising if you think about it…

torkar · January 16, 2019, 9:26am

Well, Max, I’ve looked into several likelihoods now and they all seem to lead to unstable sampling (BFMIs, divergences etc.) when I add predictors. I’ll try out the varying intercept Poisson too. Believe me, I’ll come back for more advice if this doesn’t clarify :)

Max_Mantei · January 16, 2019, 11:11am

Had a quick look at the data that you have posted. I noticed that there are no zeros in Effort so hurdle and zero-inflation won’t “work” - they’ only fit your prior. Maybe I got something wrong here.

More importantly though, your predictor matrix is not full rank. Removing Deleted (hehe) should make things work more smoothly. See below for a simple example on how to spot stuff like that.

Another problem are the weird scales that your predictors are on. The estimated coefficients are really small if you fit something with a log-link. Using a QR transformation should help - not sure if that’s in brms.

This

library(rstanarm)
fit <- stan_glm.nb(Effort ~ Input + Output + Enquiry + File + Interface + Added + Changed, data = data, QR = TRUE)

worked fine.

Hope that helps! :)

paul.buerkner · January 16, 2019, 11:17am

I would strongly suggest to scale/standardize your predictors to avoid problem with the log-scale as indicated by @Max_Mantei.

torkar · January 16, 2019, 12:29pm

Big thank you to both of you. I’ve noticed that standardizing/scaling the predictor has a great effect, and just as @Max_Mantei mentioned, I also noticed that Effort has no zeros :) I’ll sit later this afternoon to try to understand this better. The Deleted predictor has tripped me up several times so I suspected that something was fishy.

torkar · January 17, 2019, 8:27am

@Max_Mantei my y's are much lower than my \tilde{y}'s. In this plot I’ve plotted three density curves, y, \tilde{y}_{MLM}, and \tilde{y}. I’ve used negbin for both \tilde{y}, but for the MLM I’ve added an IV which is a rating (A-D) for which I estimate \sigma and \mu, i.e., using hyperparameters. New data file here.

Questions:

Why are the curves just straight lines and not nice densities, as when I plot a model w/ Poisson likelihood or when plotting my y like this, is it due to the nature of negbin?
My MLM model estimates are still much higher than my original data. Do you know of any other ways to make them fit better?

Max_Mantei · January 22, 2019, 11:18am

Hey,

Sorry for the late reply. Honestly, I don’t know what’s up here. But it seems that the first plot is broken. Maybe you could try histograms?

Do you use simulations from the posterior y_{rep} or from the linear predictor?

torkar · January 23, 2019, 12:59pm

Hi, I use posterior_predict so,

y <- orig_data_set$Effort
y_rep <- posterior_predict(fit)
ppc_dens_overlay(y, y_rep[1:50,]) + coord_cartesian(xlim=(0,100000))

and the result is here (without coord_cartesian it looks like crap :)

It looks very funky with straight lines breaking off like that so something is not alright… At least y and y_{rep} are fairly inline, but it looks too good now… I use:

fit_nb <- brm(bf(Effort  ~ 1 + Input + Output + Enquiry + File + 
                   Interface + Added + Changed + (1 | DQR)),
              family = negbinomial,
              prior = c(prior(normal(0,1), class=b),
                        prior(normal(0,1), class=Intercept)),
              cores=4, chains=4, data = dataset, 
              control = list(adapt_delta=0.99, max_treedepth=13)
)

witht the following dataset.

Ax3man · January 23, 2019, 10:54pm

Could it be that you are getting straight lines because you are zooming in a lot with coord_cartesian? Even if the distributions would be truly straight, the density plot should still not show that.

Max_Mantei · January 24, 2019, 12:08am

Your y has a huge scale, so I recommend plotting on the log scale. The glitch you see is probably due to the problem @Ax3man mentioned.

Here are some alternatives (all with library(tidyverse) loaded):
[Warning: code is super ugly/hacky and the plots are not polished, but I guess you get the idea.]

cbind(y, y_rep = t(y_rep)) %>% 
   as_tibble() %>% 
  gather(rep,y_rep,-y) %>% 
  mutate(rep = as.numeric(str_remove(rep, "V"))-1) %>% 
  filter(rep <= 20) %>% 
  mutate(rep = paste0("replication: ", rep)) %>% 
  ggplot(aes(x=y,y=y_rep)) + 
    geom_point(alpha = 0.5) + 
    scale_x_log10() + 
    scale_y_log10() + 
    geom_abline(slope = 1,intercept = 0) + 
    facet_wrap(~rep)

cbind(y, y_rep = t(y_rep)) %>% 
  as_tibble() %>% 
  gather(var,val) %>% 
  mutate(rep = as.numeric(str_remove(var, "V"))-1, 
         rep = if_else(is.na(rep), 0, rep), 
         var = if_else(rep == 0, var, paste0("replication: ", rep))) %>% 
  filter(rep <= 20) %>% 
  ggplot(aes(x=val)) + 
    geom_histogram() + 
    scale_x_log10() + 
    facet_wrap(~var)

cbind(y, y_rep = t(y_rep)) %>% 
  as_tibble() %>% 
  gather(var,val) %>% 
  mutate(rep = as.numeric(str_remove(var, "V"))-1, 
         rep = if_else(is.na(rep), 0, rep), 
         var = if_else(rep == 0, var, paste0("replication: ", rep))) %>% 
  filter(rep <= 20) %>% 
  ggplot(aes(x=val)) + 
    geom_density() + 
    scale_x_log10() + 
    facet_wrap(~var)

cbind(y, y_rep = t(y_rep)) %>% 
  as_tibble() %>% 
  gather(rep,y_rep,-y) %>% 
  mutate(rep = as.numeric(str_remove(rep, "V"))-1) %>% 
  filter(rep <= 20) %>% 
  mutate(rep = paste0("replication: ", rep)) %>% 
  ggplot() + 
    scale_x_log10() + 
    geom_density(aes(x=y_rep, color = rep), alpha = 0.6) + 
    geom_density(aes(x=y), size = 1.5)

torkar · January 24, 2019, 1:34pm

Much appreciated guys. Things look much better now :)

Topic		Replies	Views
Poisson model overestimating dispersion and inflating zeroes brms fitting-issues , poisson , phylogenetic	24	3868	August 26, 2020
Uncertain of model fit based on pp_check (brms, generalized additive model) Modeling brms	5	301	May 11, 2024
Zero_inflated_poisson with brms Modeling rstan , fitting-issues , brms	4	1622	June 14, 2021
Posterior predictive check looks weird - what can I do? Modeling posterior-predictive , brms	16	4135	April 24, 2024
Prior posterior distribution for Zero inflated Poisson model with Random Effects Modeling	12	2756	November 6, 2017

Poisson or not?

Related topics