I have a dataset of species densities per square meter inferred from camera traps in different areas (which is the treatment) and replicated across different geographical units. The data contains a lot of zeros and positive continuous data. I want to quantify the effect so the treatment
areas on species densities. I think an appropriate way to model the data is using either a hurdle_lognormal
or hurdle_gamma
model. Here’s my data summary with respect to the averages, proportion of zeros, sample sizes (zero, non-zero, total).
# A tibble: 16 × 7
# Groups: treatment [4]
treatment species average prop_zero zeros non_zeros n
<chr> <chr> <dbl> <dbl> <int> <int> <int>
1 R1 spec1 0.483 0.473 268 299 567
2 R1 spec2 0.0110 0.893 508 61 569
3 R1 spec3 0.482 0.645 367 202 569
4 R1 spec4 1.06 0.585 333 236 569
5 T1 spec1 0.570 0.412 70 100 170
6 T1 spec2 0.0183 0.829 141 29 170
7 T1 spec3 0.495 0.565 96 74 170
8 T1 spec4 0.758 0.453 77 93 170
9 T2 spec1 0.288 0.535 54 47 101
10 T2 spec2 0.000202 0.980 99 2 101
11 T2 spec3 0.278 0.634 64 37 101
12 T2 spec4 2.35 0.327 33 68 101
13 T3 spec1 0.371 0.380 70 114 184
14 T3 spec2 0.0125 0.918 169 15 184
15 T3 spec3 0.821 0.505 93 91 184
16 T3 spec4 1.05 0.402 74 110 184
Here’s a violin plot with the positive (non-zero) densities by species and treatment:
My hurdle gamma model:
m1_gamma <- brm(
bf(
# gamma model on the non-zero
density_km2_adj ~ treatment + species +
treatment : species
(1 | unit) ,
# binomial model
hu ~ treatment + species +
treatment : species+
(1 | unit)
),
family = hurdle_gamma(link = "log"),
data = df
)
At this stage, I would like to stick to default priors. The pp_check
has values on the x axis reaching for out beyond what is observed and I don’t know how I can deal with this in the modelling step. Here’s an overview and “zoomed in” version:
And this is the hurdle lognormal model:
m1_lognormal <- brm(
bf(
# gamma model on the non-zero
density_km2_adj ~ treatment + species +
treatment : species +
(1 | unit) ,
# binomial model
hu ~ treatment + species +
treatment : species+
(1 | unit)
),
family = hurdle_lognormal(),
data = df
)
Posterior predictive checks as follows (default and “zoomed in”):
To me it looks like that the lognormal model does a slightly better job at predicting the observed data.
I should also say that there were no warnings running either model.
1.) I was hoping to get some advice as to whether my approach makes sense or whether there are better ways of approaching this (especially given the result of the posterior predictive check).
2.) Would it make sense to run 4 separate models for each species? Or does it make more sense to keep species as variable in the model? Keeping as much data in the model as possible seems to be a good approach, however some species (such as spec2
) is greatly different than the other ones.
Thank you!