Fitting count data with negative binomial - long tail

Hi, I am having difficulties setting up the model and understanding what i can do. As background I beach clean-up events where we recorded the total number of counts of debris - which is my response variable and am testing different variables that can affect the accumulation of debris among the different sites. For each event we collected the effort (number of volunteers and Distance sampled in meters) which in the model below I have included as an offset (‘Dist.Vol’ = number of volunteers * Distance sampled in meters).

See how my data looks:

str(data1)
'data.frame':	1935 obs
 $ Event ID            : chr  "2128" "5111" "6236" "7082" ...
 $ Site                : Factor w/ 125 levels "Ammunition Jetty Coogee, WA",
 $ Date                : chr  "2011/10/16" "2014/10/12" "2015/08/16" 
 $ DayIntSite          : num  0 1092 308 189 160 ...
 $ Year                : num  2011 2014 2015 2016 2016 ...
 $ Total Debris            : num  1615 801 3303 2130 17534 ...
 $ number of volunteers          : int  13 18 19 75 120 48 30 82 95 3 ...
 $ Distance sampled in m          : num  170 180 300 2000 3500 503 1000 
 $ Dist.Vol            : num  2210 3240 5700 150000 420000 ...
 $ BACKPROX_Va         : Factor w/ 8 levels "Aeolians and Sheets",..: 6 6 6 
 $ BACKDIST_Va         : Factor w/ 11 levels "Aeolian Sand-Sheets",..: 11 11 
 $ slope_mean          : num  54 54 54 54 54 ...
 $ avg_Thgt            : num  3.44 1.98 1.76 1.79 2.07 ...
 $ majority_dir16      : chr  "SW" "WSW" "WSW" "WSW" ...
 $ tide_mean           : num  0.965 0.88 0.809 0.708 0.829 ...
 $ Year1               : Factor w/ 12 levels "2011","2012",..: 1 4 5 6 6 6 7 7 8 8 ...

I have tried the negative binomial with defaut priors and this is the result of the pp_check:

brms_gam_time_nb_ALL <- brm(Total ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = negbinomial())

pp_check(brms_gam_time_nb_ALL ) 

it has a loooooong tail. When zoomed in looks like this:

I have tried a few models but i do not have knowledge to change the priors on my own. I tried the hurdle negative binomial and this also seemed weird. See the result of the pp_check below.

brms_gam_time_hurdlenb <- brm(Total ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = hurdle_negbinomial())

pp_check(brms_gam_time_hurdlenb)

I decided to log transform the counts and use a gaussian but I am not sure if this is an ok thing to do. The model seems to fit better. see below. I know the best approach should have been using the neg binomial and not transform the counts.

model_brms_log_ALLxx <- brm(logTotal ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = gaussian())

pp_check(model_brms_log_ALLxx)

I can show you the results summary if needed too. But my question is if there which avenue I should follow… The negative binomial in theory is the best option but is a way of improving the model? Is the log-transformed a wrong approach?

  • Worth saying that I also tried to include the offset this way: Total | rate(Dist.Vol) instead of log(effort).
  • Rhats for negative binomial and hurdle are not great (Rhat = 1.01 for most of the variables)while using the log-transformed they are perfect (Rhat=1) for all variables.

Thank you.
Ana

2 Likes

(post deleted by author)

There are at least two things you IMHO should check:

  1. Is the problem in the residual variability after accounting for predictors (i.e. the outcome distribution is wrong) or is it specific to some subgroups of the data?

To check this, you’d want to do grouped PP checks. It might also be useful to focus on the stat_grouped PPC with variance or mean/variance (a.k.a. Fano factor) as statistics. If you see that for some soubgroups of the data (e.g. for sam values of BACKPROX_Va or low avg_Thgt) the model has too much variance/too heavy tails and for others it has too little variance/too light tails, it might be wortwhile to put those as predictors on the dispersion parameter (i.e. let the dispersion vary between observations).

If instead the problem is more or less the same in all subgroups that make sense for your data, it might indicate the problem is in the outcome distribution and you may need something that’s even more flexible than negative binomial. Poisson - LogNormal models are an example of an alternative that has some support in existing software (though would require a bit of hacking to get it running in brms). There are also even more exotic Poisson-XXX variants, but those tend to have poor software support.

Adding a random effect per observation (i.e. something like (1 | Event_ID)) could also help in this case.

  1. Make sure your offsets are OK. There are two meaningful ways to use offsets in a negative binomial model - either the offset acts on mean alone or it also scales the dispersion. See Scaling of the overdispersion in negative binomial models for discussion. If I recall correctly, using Total | rate(Dist.Vol) instead of an offset term should automatically scale the dispersion, which I’d guess might be more appropriate for your use case, but please double check.

Best of luck with your model!

1 Like

Howdy! My suggestion would be to go back to the drawing board and think about the process by which this data might be generated. I don’t think that a ‘regression’ is the best approach here. I think a latent variables model in Stan would be better. One easy way to think about the data generation process is to attempt to simulate your data. For example, you formulation of “effort” seems odd to me. The formulation of total debris sampled as a linear (on the log scale) combination of all these variables seems odder still.

As an example, here is a simplified simulation of your process. Note that I do not work in environmental sciences, so I don’t really know what the variables in your dataset mean. I used “wind_speed” in the example below, but you might use whatever contributes to debris washing up on the beach (wind direction, tide, location, time of year, or whatever).

N <- 1000
number_volunteers <- rpois(N, lambda = 20)
work_rate <- rlnorm(N, meanlog = -1, sd = 0.1)  #the length of beach a single volunteer covers in km/hr
time_worked <- rlnorm(N, meanlog = log(3), sdlog = 0.25)  #hrs
wind_speed <- rlnorm(N, meanlog = log(20), sdlog = 0.1)  #km/hr

sigma_lgn <- 0.15
distance_covered <- rlnorm(N, meanlog = log(number_volunteers * work_rate * time_worked), sdlog = sigma_lgn)  #the amount of beach that was cleaned in km
effeciency_cleanup <- rbeta(N, shape1 = number_volunteers, shape2 = distance_covered/time_worked)  #proportion of debris collected per debris available

b1 <- 3
debris_available_per_1km <- rpois(N, lambda = b1 * wind_speed)   #the actual amount of debris on the beach per km
total_debris_collected <- rpois(N, lambda = debris_available_per_1km * effeciency_cleanup * distance_covered)  #the amount of debris that was actually collected

par(mfrow=c(3,3))
hist(work_rate, breaks=100)
hist(time_worked, breaks=100)
hist(wind_speed, breaks=100)
hist(distance_covered, breaks=100)
hist(effeciency_cleanup, breaks=100)
hist(debris_available_per_1km, breaks=100)
hist(total_debris_collected, breaks=100)
hist(debris_available_per_1km*distance_covered - total_debris_collected, breaks=100)

Distance of beach covered (sampled) really sounds more like a function of the number of volunteers, the work rate per volunteer, and the time worked. The efficiency of the cleanup would seem to depend on the number of volunteers and how much distance they covered per unit time. These two variables (distance covered and efficiency) would make up some notion of effort. The amount of debris collected is then some function of the effort and amount of debris that is available to collect. The amount available to collect depends on how much is washed up (or whatever, I’m not an ecologist).

Of course, you do not have all of these variables in your dataset. In fact, the quantity that I would suspect that your are actually interested in is the latent variable debris_available_per_1km. You can fit latent variable models in Stan, where each variable is a vector of parameters. Translating the simulation of your data generation process to a model might involve fitting a latent variables model and/or making some general assumptions about some of those products if your don’t have very strong prior information on them (products like that pose model identification problems). For example, you might have to combine products into a single variable. In any case, it should be food for thought. When a model gives poor diagnostics and/or poor posterior predictive checks, it is a good indication that the model isn’t a very good model of the data generation process.

Hope that helps.

1 Like