Modelling "hurdle lognormal" vs "hurdle gamma" on data with many zeros

mcsnowface · March 18, 2025, 8:43pm

I have a dataset of species densities per square meter inferred from camera traps in different areas (which is the treatment) and replicated across different geographical units. The data contains a lot of zeros and positive continuous data. I want to quantify the effect so the treatment areas on species densities. I think an appropriate way to model the data is using either a hurdle_lognormal or hurdle_gamma model. Here’s my data summary with respect to the averages, proportion of zeros, sample sizes (zero, non-zero, total).

# A tibble: 16 × 7
# Groups:   treatment [4]
   treatment species  average prop_zero zeros non_zeros     n
   <chr>     <chr>      <dbl>     <dbl> <int>     <int> <int>
 1 R1        spec1   0.483        0.473   268       299   567
 2 R1        spec2   0.0110       0.893   508        61   569
 3 R1        spec3   0.482        0.645   367       202   569
 4 R1        spec4   1.06         0.585   333       236   569
 5 T1        spec1   0.570        0.412    70       100   170
 6 T1        spec2   0.0183       0.829   141        29   170
 7 T1        spec3   0.495        0.565    96        74   170
 8 T1        spec4   0.758        0.453    77        93   170
 9 T2        spec1   0.288        0.535    54        47   101
10 T2        spec2   0.000202     0.980    99         2   101
11 T2        spec3   0.278        0.634    64        37   101
12 T2        spec4   2.35         0.327    33        68   101
13 T3        spec1   0.371        0.380    70       114   184
14 T3        spec2   0.0125       0.918   169        15   184
15 T3        spec3   0.821        0.505    93        91   184
16 T3        spec4   1.05         0.402    74       110   184

Here’s a violin plot with the positive (non-zero) densities by species and treatment:

My hurdle gamma model:

m1_gamma <- brm(
  bf(
    # gamma model on the non-zero
    density_km2_adj  ~ treatment +  species +
      treatment : species
      (1 | unit) ,
    # binomial model
    hu ~ treatment +  species +
      treatment : species+ 
      (1 | unit) 
  ),
  family = hurdle_gamma(link = "log"),
  data = df
)

At this stage, I would like to stick to default priors. The pp_check has values on the x axis reaching for out beyond what is observed and I don’t know how I can deal with this in the modelling step. Here’s an overview and “zoomed in” version:

And this is the hurdle lognormal model:

m1_lognormal <- brm(
  bf(
    # gamma model on the non-zero
    density_km2_adj  ~ treatment +  species +
      treatment : species +
      (1 | unit) ,
    # binomial model
    hu ~ treatment +  species +
      treatment : species+ 
      (1 | unit) 
  ),
  family = hurdle_lognormal(),
  data = df
)

Posterior predictive checks as follows (default and “zoomed in”):

To me it looks like that the lognormal model does a slightly better job at predicting the observed data.

I should also say that there were no warnings running either model.

1.) I was hoping to get some advice as to whether my approach makes sense or whether there are better ways of approaching this (especially given the result of the posterior predictive check).

2.) Would it make sense to run 4 separate models for each species? Or does it make more sense to keep species as variable in the model? Keeping as much data in the model as possible seems to be a good approach, however some species (such as spec2) is greatly different than the other ones.

Thank you!

amynang · March 19, 2025, 9:18am

Hi @mcsnowface .
What are you dividing your species occurences by* to get densities? You may want to use that as an offset in a model of the raw counts using a negative binomial or Poisson distribution.

If I understand the structure of your data correctly, you might want all your population level predictors vary by unit.

Are you interested in differences between the four species?

*well area I guess but I am not familiar with camera trap data so it is not clear to me how camera captures translate to individuals per area.

mcsnowface · March 20, 2025, 2:12pm

@amynang Thanks for your response! The density is calculated using other inputs such as the time the animal spend in the frame etc. It’s not just an area. I’m following this approach https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecs2.4005

Do you have any other ideas? I mean the extreme large values are outliers and they fall outside the 95% credible intervals when calculating the expectations of the poster predictive. So they won’t necessarily effect those predictions (I think?!). Predictions from the model also match the observed data quite well too.

amynang · March 20, 2025, 6:00pm

Wow that is more complex than I thought…

Given the differences between species, it’s worth trying to extend your approach to sigma, or at least sigma ~ species if not the full set of predictors.

My other point was that your group-level terms should probably be

(species*treatment | unit)

As it is now, you are partially pooling information across units about your intercept (your reference species:treatment level) and estimating differences to that level with complete pooling.

avehtari · March 21, 2025, 8:07am

The hurdle part is likely to get the predicted proportion of zeros to match well the observed, as discussed in the context of bar graphs in Recommendations for visual predictive checks in Bayesian workflow. To improve PPC, it would be good to examine the hurdle part and continuous part separately. The hurdle part corresponds to binary model, and PAV-adjusted calibration plot would be good. For the continuous part (filter out data rows with zeros and the corresponding predictions before calling pp_check), KDE is fine, but you could add + scale_x_log10() to better see the differences in the distributions. You can also use group argument to make a separate plot for each species.

Topic		Replies	Views
Modelling continuous positive data with strong right skew AND many zeros. Log-link AND Hurdle? Modeling link-function	6	1955	February 10, 2021
Brms: does the lognormal part of the hurdle_lognormal() regression include zeros into analysis? Modeling	8	2240	September 27, 2020
Need help understanding hurdle (hurdle_gamma) models using brms brms	2	1464	July 19, 2020
Why isn't there a hurdle gaussian family? brms	1	911	May 21, 2020
Mean model for "hurdle" proportion brms	1	402	July 26, 2019

Modelling "hurdle lognormal" vs "hurdle gamma" on data with many zeros

Related topics