Comparing Hurdle Model with Non-Hurdle Model

jacollings · November 19, 2023, 5:55pm

The STAN manual has a section on ZI and hurdle models with the following code:

   if (y[n] == 0)
      1 ~ bernoulli(theta);
    else {
      0 ~ bernoulli(theta);
      y[n] ~ poisson(lambda) T[1, ];
    }

My understanding is that to get the pointwise log-likelihoods, I’d want a generated quantities chunk with the following statements:

   if (y[n] == 0)
      log_lik += bernoulli_lpdf(1 | theta);
    else {
      log_lik += bernoulli(0 | theta);
      log_lik += poisson(y[n] | lambda) T[1, ];
    }

If y contains n data points, n’ of which are 0’s, then this should produce n + (n-n’) log_lik values, correct? If so, running a non-hurdle model version of this (just a traditional poisson model) would produce n log_lik values. My understanding is that I can’t compare the ELPD’s of two models with ‘different data’ or a different number of log_lik values. I’m hoping to compare the hurdle model to a model of the same data without the hurdle component. Could someone correct my understanding or offer a solution?

simonbrauer · November 19, 2023, 10:24pm

They should be directly comparable. The likelihood for non-zero observations is the product of one minus the probability of zero and the truncated poison mass function. In other words, it’s one term per person, not two. See, for example, the users guide:

if (y[n] == 0) {
  target += log(theta);
} else {
  target += log1m(theta) + poisson_lpmf(y[n] | lambda)
                  - poisson_lccdf(0 | lambda));
}

You’ll also want to store the point wise log-likelihoods in separate elements of a vector rather than sum them together in the generated quantities block.

vector[N] log_lik;

for(n in 1:N){
if (y[n] == 0) {
  log_lik[n] = log(theta);
} else {
  log_lik[n] = log1m(theta) + poisson_lpmf(y[n] | lambda)
                         - poisson_lccdf(0 | lambda));
}

avehtari · November 20, 2023, 4:09pm

jacollings:

The STAN manual has a section on ZI and hurdle models with the following code:
   if (y[n] == 0)
      1 ~ bernoulli(theta);
    else {
      0 ~ bernoulli(theta);
      y[n] ~ poisson(lambda) T[1, ];
    }

Are you looking at the old manual version? Maybe the top of the page says “This is an old version, view current version.”? The latest version does not have that code any more (I know as I removed it). I hope the current version is more clear

if (y[n] == 0) {
  target += log(theta);
} else {
  target += log1m(theta) + poisson_lpmf(y[n] | lambda)
            - poisson_lccdf(0 | lambda));
}

Can you see from that what you need in generated quantities?

You should get N log_lik values in both cases. In Hurdle model there is a different if-else branch for zeros and non-zeros. In Poisson only one branch for both zeros and non-zeros.

Let me know if the above was not clear enough to write the generated quantities.

jacollings · November 20, 2023, 10:09pm

Ah yes! Thank you for catching and explaining my misunderstanding of the likelihood as well as catching my error in the generated quantities block.

jacollings · November 20, 2023, 10:10pm

Yes! I missed the older version warning. This, as well as Simon’s response, have helped greatly! Thank you.

jsocolar · November 21, 2023, 4:36am

An important note here is that for model comparison purposes, the above all works perfectly for families wherein the nonzero component of the response is discrete, but model comparison is more complicated when the nonzero component of the response is continuous, because the log-likelihood calculation mixes probability mass (for the data zeros) with probability density (for the data nonzeros).

Formally, whenever there is a literal zero in the data, a zero-hurdle model should be infinitely preferred over a non-hurdle continuous model, because probability masses are infinitely larger than probability densities.

Thus, to compare a hurdle-continuous family to its non-hurdle counterpart, you need to think carefully about how to weight the probability masses in the likelihood versus the probability densities. It’s often appropriate to take the log-likelihood of zeros, even in the model without the hurdle component, to be the probability mass associated with integrating the probability density over some region of practical equivalence to zero (in which case it would be best to fit the non-hurdle version of the model using this likelihood).

Topic		Replies	Views
Log likelihood for hurdle model Modeling loo	6	1617	May 24, 2017
Modelling "hurdle lognormal" vs "hurdle gamma" on data with many zeros Modeling specification , brms	4	84	March 21, 2025
Best way to calculate the mean in a regression hurdle model Modeling mixture	7	124	February 25, 2025
Coding hurdle model with censored data Modeling rstan , specification	2	642	November 24, 2022
Preventing numerical 0s in hurdle model Modeling	3	514	August 21, 2020

Comparing Hurdle Model with Non-Hurdle Model

Related topics