Fitting count data with negative binomial - long tail

Hi, I am having difficulties setting up the model and understanding what i can do. As background I beach clean-up events where we recorded the total number of counts of debris - which is my response variable and am testing different variables that can affect the accumulation of debris among the different sites. For each event we collected the effort (number of volunteers and Distance sampled in meters) which in the model below I have included as an offset (‘Dist.Vol’ = number of volunteers * Distance sampled in meters).

See how my data looks:

str(data1)
'data.frame':	1935 obs
 $ Event ID            : chr  "2128" "5111" "6236" "7082" ...
 $ Site                : Factor w/ 125 levels "Ammunition Jetty Coogee, WA",
 $ Date                : chr  "2011/10/16" "2014/10/12" "2015/08/16" 
 $ DayIntSite          : num  0 1092 308 189 160 ...
 $ Year                : num  2011 2014 2015 2016 2016 ...
 $ Total Debris            : num  1615 801 3303 2130 17534 ...
 $ number of volunteers          : int  13 18 19 75 120 48 30 82 95 3 ...
 $ Distance sampled in m          : num  170 180 300 2000 3500 503 1000 
 $ Dist.Vol            : num  2210 3240 5700 150000 420000 ...
 $ BACKPROX_Va         : Factor w/ 8 levels "Aeolians and Sheets",..: 6 6 6 
 $ BACKDIST_Va         : Factor w/ 11 levels "Aeolian Sand-Sheets",..: 11 11 
 $ slope_mean          : num  54 54 54 54 54 ...
 $ avg_Thgt            : num  3.44 1.98 1.76 1.79 2.07 ...
 $ majority_dir16      : chr  "SW" "WSW" "WSW" "WSW" ...
 $ tide_mean           : num  0.965 0.88 0.809 0.708 0.829 ...
 $ Year1               : Factor w/ 12 levels "2011","2012",..: 1 4 5 6 6 6 7 7 8 8 ...

I have tried the negative binomial with defaut priors and this is the result of the pp_check:

brms_gam_time_nb_ALL <- brm(Total ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = negbinomial())

pp_check(brms_gam_time_nb_ALL ) 

it has a loooooong tail. When zoomed in looks like this:

I have tried a few models but i do not have knowledge to change the priors on my own. I tried the hurdle negative binomial and this also seemed weird. See the result of the pp_check below.

brms_gam_time_hurdlenb <- brm(Total ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = hurdle_negbinomial())

pp_check(brms_gam_time_hurdlenb)

I decided to log transform the counts and use a gaussian but I am not sure if this is an ok thing to do. The model seems to fit better. see below. I know the best approach should have been using the neg binomial and not transform the counts.

model_brms_log_ALLxx <- brm(logTotal ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = gaussian())

pp_check(model_brms_log_ALLxx)

I can show you the results summary if needed too. But my question is if there which avenue I should follow… The negative binomial in theory is the best option but is a way of improving the model? Is the log-transformed a wrong approach?

  • Worth saying that I also tried to include the offset this way: Total | rate(Dist.Vol) instead of log(effort).
  • Rhats for negative binomial and hurdle are not great (Rhat = 1.01 for most of the variables)while using the log-transformed they are perfect (Rhat=1) for all variables.

Thank you.
Ana

2 Likes

There are at least two things you IMHO should check:

  1. Is the problem in the residual variability after accounting for predictors (i.e. the outcome distribution is wrong) or is it specific to some subgroups of the data?

To check this, you’d want to do grouped PP checks. It might also be useful to focus on the stat_grouped PPC with variance or mean/variance (a.k.a. Fano factor) as statistics. If you see that for some soubgroups of the data (e.g. for sam values of BACKPROX_Va or low avg_Thgt) the model has too much variance/too heavy tails and for others it has too little variance/too light tails, it might be wortwhile to put those as predictors on the dispersion parameter (i.e. let the dispersion vary between observations).

If instead the problem is more or less the same in all subgroups that make sense for your data, it might indicate the problem is in the outcome distribution and you may need something that’s even more flexible than negative binomial. Poisson - LogNormal models are an example of an alternative that has some support in existing software (though would require a bit of hacking to get it running in brms). There are also even more exotic Poisson-XXX variants, but those tend to have poor software support.

Adding a random effect per observation (i.e. something like (1 | Event_ID)) could also help in this case.

  1. Make sure your offsets are OK. There are two meaningful ways to use offsets in a negative binomial model - either the offset acts on mean alone or it also scales the dispersion. See Scaling of the overdispersion in negative binomial models for discussion. If I recall correctly, using Total | rate(Dist.Vol) instead of an offset term should automatically scale the dispersion, which I’d guess might be more appropriate for your use case, but please double check.

Best of luck with your model!

2 Likes

Howdy! My suggestion would be to go back to the drawing board and think about the process by which this data might be generated. I don’t think that a ‘regression’ is the best approach here. I think a latent variables model in Stan would be better. One easy way to think about the data generation process is to attempt to simulate your data. For example, you formulation of “effort” seems odd to me. The formulation of total debris sampled as a linear (on the log scale) combination of all these variables seems odder still.

As an example, here is a simplified simulation of your process. Note that I do not work in environmental sciences, so I don’t really know what the variables in your dataset mean. I used “wind_speed” in the example below, but you might use whatever contributes to debris washing up on the beach (wind direction, tide, location, time of year, or whatever).

N <- 1000
number_volunteers <- rpois(N, lambda = 20)
work_rate <- rlnorm(N, meanlog = -1, sd = 0.1)  #the length of beach a single volunteer covers in km/hr
time_worked <- rlnorm(N, meanlog = log(3), sdlog = 0.25)  #hrs
wind_speed <- rlnorm(N, meanlog = log(20), sdlog = 0.1)  #km/hr

sigma_lgn <- 0.15
distance_covered <- rlnorm(N, meanlog = log(number_volunteers * work_rate * time_worked), sdlog = sigma_lgn)  #the amount of beach that was cleaned in km
effeciency_cleanup <- rbeta(N, shape1 = number_volunteers, shape2 = distance_covered/time_worked)  #proportion of debris collected per debris available

b1 <- 3
debris_available_per_1km <- rpois(N, lambda = b1 * wind_speed)   #the actual amount of debris on the beach per km
total_debris_collected <- rpois(N, lambda = debris_available_per_1km * effeciency_cleanup * distance_covered)  #the amount of debris that was actually collected

par(mfrow=c(3,3))
hist(work_rate, breaks=100)
hist(time_worked, breaks=100)
hist(wind_speed, breaks=100)
hist(distance_covered, breaks=100)
hist(effeciency_cleanup, breaks=100)
hist(debris_available_per_1km, breaks=100)
hist(total_debris_collected, breaks=100)
hist(debris_available_per_1km*distance_covered - total_debris_collected, breaks=100)

Distance of beach covered (sampled) really sounds more like a function of the number of volunteers, the work rate per volunteer, and the time worked. The efficiency of the cleanup would seem to depend on the number of volunteers and how much distance they covered per unit time. These two variables (distance covered and efficiency) would make up some notion of effort. The amount of debris collected is then some function of the effort and amount of debris that is available to collect. The amount available to collect depends on how much is washed up (or whatever, I’m not an ecologist).

Of course, you do not have all of these variables in your dataset. In fact, the quantity that I would suspect that your are actually interested in is the latent variable debris_available_per_1km. You can fit latent variable models in Stan, where each variable is a vector of parameters. Translating the simulation of your data generation process to a model might involve fitting a latent variables model and/or making some general assumptions about some of those products if your don’t have very strong prior information on them (products like that pose model identification problems). For example, you might have to combine products into a single variable. In any case, it should be food for thought. When a model gives poor diagnostics and/or poor posterior predictive checks, it is a good indication that the model isn’t a very good model of the data generation process.

Hope that helps.

3 Likes

Thank you for your reply. This seems very reasonable. I have checked the PPC, results below. And all of the predictors seem slightly right tail but not extreme.




I followed your advise on how to include he effort and included the eventID as random effect and have tried the following model:

brms_gam_time_nbrate5 <- Total|rate(Dist.Vol) ~ tide_mean + slope_mean + avg_Thgt + dir16_num + BACKPROX_Va +DayIntSite + (1 | Year1) + (1 | Site) + (1|Event.ID), data = data1, family = negbinomial(), save_pars = save_pars(all = TRUE),control = list(max_treedepth = 15)

This improves a lot the pp_check although it still seems very right tail? Would this be a problem or seems reasonable? At least I can see the trend without having to zoom it. But when zoomed in it follows exactly the trend of the data.

Results:

##  Family: negbinomial 
##   Links: mu = log; shape = identity 
## Formula: Total | rate(Dist.Vol) ~ tide_mean + slope_mean + avg_Thgt + dir16_num + BACKPROX_Va + DayIntSite + (1 | Year1) + (1 | Site) + (1 | Event.ID) 
##    Data: data1 (Number of observations: 1935) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Multilevel Hyperparameters:
## ~Event.ID (Number of levels: 1935) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.34      0.02     1.29     1.39 1.01      289     2121
## 
## ~Site (Number of levels: 125) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.82      0.07     0.69     0.97 1.00     1565     2771
## 
## ~Year1 (Number of levels: 12) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.21      0.07     0.10     0.37 1.00     1791     2341
## 
## Regression Coefficients:
##                                   Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept                            -2.23      1.59    -5.31     0.90 1.00
## tide_mean                            -0.11      0.12    -0.33     0.13 1.00
## slope_mean                           -0.00      0.01    -0.01     0.01 1.00
## avg_Thgt                              0.41      0.06     0.30     0.52 1.00
## dir16_num                             0.00      0.00    -0.00     0.01 1.00
## BACKPROX_VaArtificialStructures      -0.96      1.28    -3.40     1.60 1.00
## BACKPROX_VaBeachRidges               -1.86      1.32    -4.45     0.76 1.00
## BACKPROX_VaColluvium                 -2.18      1.55    -5.20     0.81 1.00
## BACKPROX_VaDunefieldorbeachridges    -2.21      1.30    -4.76     0.33 1.00
## BACKPROX_VaDunes                     -2.23      1.17    -4.49     0.11 1.00
## BACKPROX_VaHardbedrock               -1.28      1.20    -3.65     1.07 1.00
## BACKPROX_VaSandDeposits              -1.58      1.20    -3.90     0.78 1.00
## DayIntSite                           -0.00      0.00    -0.00     0.00 1.00
##                                   Bulk_ESS Tail_ESS
## Intercept                             3451     2877
## tide_mean                             3777     3312
## slope_mean                            6288     2624
## avg_Thgt                              2067     2357
## dir16_num                             2953     2749
## BACKPROX_VaArtificialStructures       3806     2958
## BACKPROX_VaBeachRidges                4068     2791
## BACKPROX_VaColluvium                  4359     2870
## BACKPROX_VaDunefieldorbeachridges     3891     3083
## BACKPROX_VaDunes                      3937     2601
## BACKPROX_VaHardbedrock                3759     2704
## BACKPROX_VaSandDeposits               3787     2860
## DayIntSite                            4321     2876
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape     0.06      0.02     0.03     0.12 1.05       59      178
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Although, still overdispersed and the conf intervals for Backprox seem really high. Would you still recommend trying the poisson-lognormal?

Thank you a lot for your help :)

1 Like

Thank you. I will have a search on what this is. This seems way out of my league (too complex for me). It may require some time studying this.

Those PPCs look very normal (maybe even too good :-) This may in fact suggest that variance/fano factor is not the best statistic to look at. We know that the overall density PPC shows problems, so we expect to find a “culprit” somewhere and those checks don’t show us any. Maybe using the dens_overlay_grouped PPC would be better in this case. Or maybe a different statistic (skewness?) could help.

I would definitely first try just the rate formulation without the random effect. The per-observation random effect is a bit of an extreme measure - as witnessed by your need to set high max_treedepth - I presume you had sampling problems otherwise.

I don’t really follow the reasoning here. The plot looks like the posterior predictive distribution aligns very closely with the observed one. However, that’s somewhat unsurprising as you are now using a much more flexible model (the increased flexibility then might also account for the wide posterior credible intervals you see).

Also when using the per-observation random effect, the model should be better behaved when using a poisson family (it then behaves like a Poisson-lognormal model).

1 Like

Thank you again for the help and patience :)

Yes. The model was asking for a max_treedepth above 10 and the model took roughly 7hours to run.

brms_gam_nbrate_nb <- brm(Total|rate(Dist.Vol) ~ tide_mean + slope_mean + avg_Thgt + dir16_num + BACKPROX_Va + DayIntSite + (1 | Year1) + (1 | Site), data = data1, family = negbinomial(), chains = 4, cores = 4, save_pars = save_pars(all = TRUE))

The previous ppc’s were for the model with the event.ID as a random effect. I did the above models as suggested and did the ppc checks for skewness and the fano factor again.

Checks for the negative binomial for the fano factor and the skewness:

  • tide_mean, dire16-num, avg_Thgt and slope have the same plot exactly so i have included only the one for the tide
  • when using dens_overlay_grouped it is difficult to see anything on the plots.





From these plots I can see some of the categories from BACKPROX are skewed so I am guessing I can use the shape parameter for backprox to see if this improves the model?

Family: negbinomial 
  Links: mu = log; shape = identity 
Formula: Total | rate(Dist.Vol) ~ tide_mean + slope_mean + avg_Thgt + dir16_num + BACKPROX_Va + DayIntSite + (1 | Year1) + (1 | Site) 
   Data: data1 (Number of observations: 1935) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~Site (Number of levels: 125) 
              Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept)     1.16      0.09     1.00     1.36 1.01
              Bulk_ESS Tail_ESS
sd(Intercept)      526      884

~Year1 (Number of levels: 12) 
              Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept)     0.45      0.11     0.29     0.73 1.00
              Bulk_ESS Tail_ESS
sd(Intercept)      862     1353

Regression Coefficients:
                                  Estimate Est.Error l-95% CI
Intercept                            -4.30      1.59    -7.49
tide_mean                            -0.03      0.14    -0.30
slope_mean                           -0.01      0.01    -0.02
avg_Thgt                             -0.06      0.05    -0.15
dir16_num                             0.01      0.00     0.01
BACKPROX_VaArtificialStructures      -0.01      1.49    -2.87
BACKPROX_VaBeachRidges               -0.28      1.51    -3.30
BACKPROX_VaColluvium                 -0.93      1.85    -4.52
BACKPROX_VaDunefieldorbeachridges    -1.43      1.51    -4.44
BACKPROX_VaDunes                     -1.22      1.35    -3.88
BACKPROX_VaHardbedrock                0.06      1.40    -2.78
BACKPROX_VaSandDeposits              -0.91      1.36    -3.61
DayIntSite                           -0.00      0.00    -0.00
                                  u-95% CI Rhat Bulk_ESS
Intercept                            -1.25 1.01      502
tide_mean                             0.26 1.01      483
slope_mean                            0.01 1.01      413
avg_Thgt                              0.03 1.00     1985
dir16_num                             0.02 1.00     3234
BACKPROX_VaArtificialStructures       2.97 1.00      520
BACKPROX_VaBeachRidges                2.66 1.00      538
BACKPROX_VaColluvium                  2.72 1.00      635
BACKPROX_VaDunefieldorbeachridges     1.64 1.01      489
BACKPROX_VaDunes                      1.48 1.00      459
BACKPROX_VaHardbedrock                2.87 1.00      458
BACKPROX_VaSandDeposits               1.78 1.01      454
DayIntSite                            0.00 1.00     4720
                                  Tail_ESS
Intercept                              999
tide_mean                             1094
slope_mean                             934
avg_Thgt                              2721
dir16_num                             2756
BACKPROX_VaArtificialStructures       1178
BACKPROX_VaBeachRidges                1219
BACKPROX_VaColluvium                  1256
BACKPROX_VaDunefieldorbeachridges      996
BACKPROX_VaDunes                      1087
BACKPROX_VaHardbedrock                 959
BACKPROX_VaSandDeposits               1046
DayIntSite                            3411

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
shape     0.00      0.00     0.00     0.00 1.00     4660
      Tail_ESS
shape     2721

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Thank you. that makes sense. :)

Just to be completely clear - the problem is not that some of the categories are skewed, but that some are substantially less skewed, than the model predicts while other are substantially more skewed than the model predicts.

Yes, putting shape ~ BACKPROX_Va could alleviate the problems - but since the problem lies in skewness (but not variance), it is possible that the problem is something else (e.g. the negative binomial assumption being problematic).

Hope you can make some more progress!