Hi, I am having difficulties setting up the model and understanding what i can do. As background I beach clean-up events where we recorded the total number of counts of debris - which is my response variable and am testing different variables that can affect the accumulation of debris among the different sites. For each event we collected the effort (number of volunteers and Distance sampled in meters) which in the model below I have included as an offset (‘Dist.Vol’ = number of volunteers * Distance sampled in meters).
See how my data looks:
str(data1)
'data.frame': 1935 obs
$ Event ID : chr "2128" "5111" "6236" "7082" ...
$ Site : Factor w/ 125 levels "Ammunition Jetty Coogee, WA",
$ Date : chr "2011/10/16" "2014/10/12" "2015/08/16"
$ DayIntSite : num 0 1092 308 189 160 ...
$ Year : num 2011 2014 2015 2016 2016 ...
$ Total Debris : num 1615 801 3303 2130 17534 ...
$ number of volunteers : int 13 18 19 75 120 48 30 82 95 3 ...
$ Distance sampled in m : num 170 180 300 2000 3500 503 1000
$ Dist.Vol : num 2210 3240 5700 150000 420000 ...
$ BACKPROX_Va : Factor w/ 8 levels "Aeolians and Sheets",..: 6 6 6
$ BACKDIST_Va : Factor w/ 11 levels "Aeolian Sand-Sheets",..: 11 11
$ slope_mean : num 54 54 54 54 54 ...
$ avg_Thgt : num 3.44 1.98 1.76 1.79 2.07 ...
$ majority_dir16 : chr "SW" "WSW" "WSW" "WSW" ...
$ tide_mean : num 0.965 0.88 0.809 0.708 0.829 ...
$ Year1 : Factor w/ 12 levels "2011","2012",..: 1 4 5 6 6 6 7 7 8 8 ...
I have tried the negative binomial with defaut priors and this is the result of the pp_check:
brms_gam_time_nb_ALL <- brm(Total ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = negbinomial())
pp_check(brms_gam_time_nb_ALL )
it has a loooooong tail. When zoomed in looks like this:
I have tried a few models but i do not have knowledge to change the priors on my own. I tried the hurdle negative binomial and this also seemed weird. See the result of the pp_check below.
brms_gam_time_hurdlenb <- brm(Total ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = hurdle_negbinomial())
pp_check(brms_gam_time_hurdlenb)
I decided to log transform the counts and use a gaussian but I am not sure if this is an ok thing to do. The model seems to fit better. see below. I know the best approach should have been using the neg binomial and not transform the counts.
model_brms_log_ALLxx <- brm(logTotal ~ tide_mean + slope_mean + avg_Thgt + majority_dir16 + BACKPROX_Va + BACKDIST_Va + DayIntSite + (1 | Year1) + (1 | Site) + offset(log(Dist.Vol)), data = data1, family = gaussian())
pp_check(model_brms_log_ALLxx)
I can show you the results summary if needed too. But my question is if there which avenue I should follow… The negative binomial in theory is the best option but is a way of improving the model? Is the log-transformed a wrong approach?
- Worth saying that I also tried to include the offset this way: Total | rate(Dist.Vol) instead of log(effort).
- Rhats for negative binomial and hurdle are not great (Rhat = 1.01 for most of the variables)while using the log-transformed they are perfect (Rhat=1) for all variables.
Thank you.
Ana