Poisson or not?

Dear all,

I have a data set where I want to predict the outcome Effort depending on a number of predictors (8 in total).

My assumption is that I should use family=gaussian so,

fit <- brm(bf(Effort  ~ Input + Output + Enquiry + File + Interface + Added + 
             Changed + Deleted), 
           family=gaussian,
           prior=c(prior(normal(0,1), class=b)),
           cores=4, chains=4, data=foo)

ran fine, but when I did a posterior predictive check it looked ridiculous (Effort and all predictors are >=0 in the data set). Here is the data (N=555).

So, I thought that I was mistaken and picked Poisson instead, but many divergences and BFMIs with a simple model such as Effort ~ 1 + Input, didnā€™t make me any wiser (and I even checked the results from using zi).

Can someone please explain to me what is going on. Iā€™m missing something obvious but Iā€™m sure some of you know what that might be.

Not that it matters but:

> packageVersion("brms")
[1] ā€˜2.7.0ā€™

Maybe try Gamma with log-link. The Gamma has heteroskedastic variance with \phi \mu^2, which could make sense from a quick glance at the plot you gave.

Also, the log-normal could make sense, as it is similar to the Gamma. Idk how the log-normal is implemented in brms though and the Gamma is nice, because it fits the GLM framework.

1 Like

Can you explain your response variable Effort a bit more? Without understanding the nature of the response, itā€™s hard to give any advice on a good distribution.

Thanks a bunch for the interest.

Effort is actual effort measured in person-hours for a complex (or not) system being developed.
The IVs (input, output, etc.) are units of measurement to express the amount of business functionality an information system (as a product) provides to a user.

All data are assembled and then quality assessed by an ā€œexpertā€ - the data we see here are of the ā€œhighestā€ qualityā€¦ There are lots of zeros in the set and I intend to check what happens when I assume zeros are missing values and impute them.

This is indeed some sort of ā€œcountā€ data (number of hours) so count models could be worth a try. If you have many zeros, you might consider zero-inflated or hurdle models, as well. See vignette("brms_distreg") for some examples.

Good @paul.buerkner, then I wasnā€™t way off. I tried family=zero_inflated_poisson,

fit <- brm(bf(Effort  ~ 1 + Input + Deleted, zi ~ Deleted),
                       family = zero_inflated_poisson,
                       prior=c(prior(normal(0,1), class=b)),
                       cores=4, chains=4, data = foo)

where the zi is on Deleted (which contains many zeros). It pukes:

1: There were 6 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
2: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See

Next up, Iā€™ll also check Gamma and Log_normal, before going for hurdle models (which Iā€™ve never used).

For Gamma I canā€™t even start sampling (even if setting inits=0). Gamma(log) with inits=0 actually looks ok, but Iā€™ll compare with a hurdle model (since I want to learn how to use them also).

Sorry, I overlooked that Effort can be equal to zero. Then you would need to do a Gamma hurdle oder Log-normal hurdle model, since they have 0 probability density at zero.

Again, my guess would be that overdispersion is probably one issue the poisson has in your case.

You could also try with a negative binomial - a sort of mixture of poisson and gamma - with Zero-Inflation. This should take care of overdispersion. Another possibility is to fit a poisson with varying intercepts (I think this is described in the German/Hill book somewhere) - in my work I often found that a poisson regression a with varying intercepts is incredibly similar to the negative binomial. Which is not that surprising if you think about itā€¦

Well, Max, Iā€™ve looked into several likelihoods now and they all seem to lead to unstable sampling (BFMIs, divergences etc.) when I add predictors. Iā€™ll try out the varying intercept Poisson too. Believe me, Iā€™ll come back for more advice if this doesnā€™t clarify :)

Had a quick look at the data that you have posted. I noticed that there are no zeros in Effort so hurdle and zero-inflation wonā€™t ā€œworkā€ - theyā€™ only fit your prior. Maybe I got something wrong here.

More importantly though, your predictor matrix is not full rank. Removing Deleted (hehe) should make things work more smoothly. See below for a simple example on how to spot stuff like that.

Another problem are the weird scales that your predictors are on. The estimated coefficients are really small if you fit something with a log-link. Using a QR transformation should help - not sure if thatā€™s in brms.

This

library(rstanarm)
fit <- stan_glm.nb(Effort ~ Input + Output + Enquiry + File + Interface + Added + Changed, data = data, QR = TRUE)

worked fine.

Hope that helps! :)

I would strongly suggest to scale/standardize your predictors to avoid problem with the log-scale as indicated by @Max_Mantei.

Big thank you to both of you. Iā€™ve noticed that standardizing/scaling the predictor has a great effect, and just as @Max_Mantei mentioned, I also noticed that Effort has no zeros :) Iā€™ll sit later this afternoon to try to understand this better. The Deleted predictor has tripped me up several times so I suspected that something was fishy.

@Max_Mantei my y's are much lower than my \tilde{y}'s. In this plot Iā€™ve plotted three density curves, y, \tilde{y}_{MLM}, and \tilde{y}. Iā€™ve used negbin for both \tilde{y}, but for the MLM Iā€™ve added an IV which is a rating (A-D) for which I estimate \sigma and \mu, i.e., using hyperparameters. New data file here.

Questions:

  1. Why are the curves just straight lines and not nice densities, as when I plot a model w/ Poisson likelihood or when plotting my y like this, is it due to the nature of negbin?
  2. My MLM model estimates are still much higher than my original data. Do you know of any other ways to make them fit better?

Hey,

Sorry for the late reply. Honestly, I donā€™t know whatā€™s up here. But it seems that the first plot is broken. Maybe you could try histograms?

Do you use simulations from the posterior y_{rep} or from the linear predictor?

Hi, I use posterior_predict so,

y <- orig_data_set$Effort
y_rep <- posterior_predict(fit)
ppc_dens_overlay(y, y_rep[1:50,]) + coord_cartesian(xlim=(0,100000))

and the result is here (without coord_cartesian it looks like crap :)

It looks very funky with straight lines breaking off like that so something is not alrightā€¦ At least y and y_{rep} are fairly inline, but it looks too good nowā€¦ I use:

fit_nb <- brm(bf(Effort  ~ 1 + Input + Output + Enquiry + File + 
                   Interface + Added + Changed + (1 | DQR)),
              family = negbinomial,
              prior = c(prior(normal(0,1), class=b),
                        prior(normal(0,1), class=Intercept)),
              cores=4, chains=4, data = dataset, 
              control = list(adapt_delta=0.99, max_treedepth=13)
)

witht the following dataset.

Could it be that you are getting straight lines because you are zooming in a lot with coord_cartesian? Even if the distributions would be truly straight, the density plot should still not show that.

1 Like

Your y has a huge scale, so I recommend plotting on the log scale. The glitch you see is probably due to the problem @Ax3man mentioned.

Here are some alternatives (all with library(tidyverse) loaded):
[Warning: code is super ugly/hacky and the plots are not polished, but I guess you get the idea.]

cbind(y, y_rep = t(y_rep)) %>% 
   as_tibble() %>% 
  gather(rep,y_rep,-y) %>% 
  mutate(rep = as.numeric(str_remove(rep, "V"))-1) %>% 
  filter(rep <= 20) %>% 
  mutate(rep = paste0("replication: ", rep)) %>% 
  ggplot(aes(x=y,y=y_rep)) + 
    geom_point(alpha = 0.5) + 
    scale_x_log10() + 
    scale_y_log10() + 
    geom_abline(slope = 1,intercept = 0) + 
    facet_wrap(~rep)


cbind(y, y_rep = t(y_rep)) %>% 
  as_tibble() %>% 
  gather(var,val) %>% 
  mutate(rep = as.numeric(str_remove(var, "V"))-1, 
         rep = if_else(is.na(rep), 0, rep), 
         var = if_else(rep == 0, var, paste0("replication: ", rep))) %>% 
  filter(rep <= 20) %>% 
  ggplot(aes(x=val)) + 
    geom_histogram() + 
    scale_x_log10() + 
    facet_wrap(~var)


cbind(y, y_rep = t(y_rep)) %>% 
  as_tibble() %>% 
  gather(var,val) %>% 
  mutate(rep = as.numeric(str_remove(var, "V"))-1, 
         rep = if_else(is.na(rep), 0, rep), 
         var = if_else(rep == 0, var, paste0("replication: ", rep))) %>% 
  filter(rep <= 20) %>% 
  ggplot(aes(x=val)) + 
    geom_density() + 
    scale_x_log10() + 
    facet_wrap(~var)


cbind(y, y_rep = t(y_rep)) %>% 
  as_tibble() %>% 
  gather(rep,y_rep,-y) %>% 
  mutate(rep = as.numeric(str_remove(rep, "V"))-1) %>% 
  filter(rep <= 20) %>% 
  mutate(rep = paste0("replication: ", rep)) %>% 
  ggplot() + 
    scale_x_log10() + 
    geom_density(aes(x=y_rep, color = rep), alpha = 0.6) + 
    geom_density(aes(x=y), size = 1.5)

2 Likes

Much appreciated guys. Things look much better now :)