Modelling "hurdle lognormal" vs "hurdle gamma" on data with many zeros

I have a dataset of species densities per square meter inferred from camera traps in different areas (which is the treatment) and replicated across different geographical units. The data contains a lot of zeros and positive continuous data. I want to quantify the effect so the treatment areas on species densities. I think an appropriate way to model the data is using either a hurdle_lognormal or hurdle_gamma model. Here’s my data summary with respect to the averages, proportion of zeros, sample sizes (zero, non-zero, total).

# A tibble: 16 × 7
# Groups:   treatment [4]
   treatment species  average prop_zero zeros non_zeros     n
   <chr>     <chr>      <dbl>     <dbl> <int>     <int> <int>
 1 R1        spec1   0.483        0.473   268       299   567
 2 R1        spec2   0.0110       0.893   508        61   569
 3 R1        spec3   0.482        0.645   367       202   569
 4 R1        spec4   1.06         0.585   333       236   569
 5 T1        spec1   0.570        0.412    70       100   170
 6 T1        spec2   0.0183       0.829   141        29   170
 7 T1        spec3   0.495        0.565    96        74   170
 8 T1        spec4   0.758        0.453    77        93   170
 9 T2        spec1   0.288        0.535    54        47   101
10 T2        spec2   0.000202     0.980    99         2   101
11 T2        spec3   0.278        0.634    64        37   101
12 T2        spec4   2.35         0.327    33        68   101
13 T3        spec1   0.371        0.380    70       114   184
14 T3        spec2   0.0125       0.918   169        15   184
15 T3        spec3   0.821        0.505    93        91   184
16 T3        spec4   1.05         0.402    74       110   184

Here’s a violin plot with the positive (non-zero) densities by species and treatment:

My hurdle gamma model:

m1_gamma <- brm(
  bf(
    # gamma model on the non-zero
    density_km2_adj  ~ treatment +  species +
      treatment : species
      (1 | unit) ,
    # binomial model
    hu ~ treatment +  species +
      treatment : species+ 
      (1 | unit) 
  ),
  family = hurdle_gamma(link = "log"),
  data = df
)

At this stage, I would like to stick to default priors. The pp_check has values on the x axis reaching for out beyond what is observed and I don’t know how I can deal with this in the modelling step. Here’s an overview and “zoomed in” version:

And this is the hurdle lognormal model:

m1_lognormal <- brm(
  bf(
    # gamma model on the non-zero
    density_km2_adj  ~ treatment +  species +
      treatment : species +
      (1 | unit) ,
    # binomial model
    hu ~ treatment +  species +
      treatment : species+ 
      (1 | unit) 
  ),
  family = hurdle_lognormal(),
  data = df
)

Posterior predictive checks as follows (default and “zoomed in”):

To me it looks like that the lognormal model does a slightly better job at predicting the observed data.

I should also say that there were no warnings running either model.

1.) I was hoping to get some advice as to whether my approach makes sense or whether there are better ways of approaching this (especially given the result of the posterior predictive check).

2.) Would it make sense to run 4 separate models for each species? Or does it make more sense to keep species as variable in the model? Keeping as much data in the model as possible seems to be a good approach, however some species (such as spec2) is greatly different than the other ones.

Thank you!

Hi @mcsnowface .
What are you dividing your species occurences by* to get densities? You may want to use that as an offset in a model of the raw counts using a negative binomial or Poisson distribution.

If I understand the structure of your data correctly, you might want all your population level predictors vary by unit.

Are you interested in differences between the four species?

*well area I guess but I am not familiar with camera trap data so it is not clear to me how camera captures translate to individuals per area.

@amynang Thanks for your response! The density is calculated using other inputs such as the time the animal spend in the frame etc. It’s not just an area. I’m following this approach https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecs2.4005

Do you have any other ideas? I mean the extreme large values are outliers and they fall outside the 95% credible intervals when calculating the expectations of the poster predictive. So they won’t necessarily effect those predictions (I think?!). Predictions from the model also match the observed data quite well too.

Wow that is more complex than I thought…

Given the differences between species, it’s worth trying to extend your approach to sigma, or at least sigma ~ species if not the full set of predictors.

My other point was that your group-level terms should probably be

(species*treatment | unit)

As it is now, you are partially pooling information across units about your intercept (your reference species:treatment level) and estimating differences to that level with complete pooling.

1 Like

The hurdle part is likely to get the predicted proportion of zeros to match well the observed, as discussed in the context of bar graphs in Recommendations for visual predictive checks in Bayesian workflow. To improve PPC, it would be good to examine the hurdle part and continuous part separately. The hurdle part corresponds to binary model, and PAV-adjusted calibration plot would be good. For the continuous part (filter out data rows with zeros and the corresponding predictions before calling pp_check), KDE is fine, but you could add + scale_x_log10() to better see the differences in the distributions. You can also use group argument to make a separate plot for each species.

2 Likes