Comparing Hurdle Model with Non-Hurdle Model

The STAN manual has a section on ZI and hurdle models with the following code:

   if (y[n] == 0)
      1 ~ bernoulli(theta);
    else {
      0 ~ bernoulli(theta);
      y[n] ~ poisson(lambda) T[1, ];
    }

My understanding is that to get the pointwise log-likelihoods, I’d want a generated quantities chunk with the following statements:

   if (y[n] == 0)
      log_lik += bernoulli_lpdf(1 | theta);
    else {
      log_lik += bernoulli(0 | theta);
      log_lik += poisson(y[n] | lambda) T[1, ];
    }

If y contains n data points, n’ of which are 0’s, then this should produce n + (n-n’) log_lik values, correct? If so, running a non-hurdle model version of this (just a traditional poisson model) would produce n log_lik values. My understanding is that I can’t compare the ELPD’s of two models with ‘different data’ or a different number of log_lik values. I’m hoping to compare the hurdle model to a model of the same data without the hurdle component. Could someone correct my understanding or offer a solution?

1 Like

They should be directly comparable. The likelihood for non-zero observations is the product of one minus the probability of zero and the truncated poison mass function. In other words, it’s one term per person, not two. See, for example, the users guide:

if (y[n] == 0) {
  target += log(theta);
} else {
  target += log1m(theta) + poisson_lpmf(y[n] | lambda)
                  - poisson_lccdf(0 | lambda));
}

You’ll also want to store the point wise log-likelihoods in separate elements of a vector rather than sum them together in the generated quantities block.

vector[N] log_lik;

for(n in 1:N){
if (y[n] == 0) {
  log_lik[n] = log(theta);
} else {
  log_lik[n] = log1m(theta) + poisson_lpmf(y[n] | lambda)
                         - poisson_lccdf(0 | lambda));
} 
3 Likes

Are you looking at the old manual version? Maybe the top of the page says “This is an old version, view current version.”? The latest version does not have that code any more (I know as I removed it). I hope the current version is more clear

if (y[n] == 0) {
  target += log(theta);
} else {
  target += log1m(theta) + poisson_lpmf(y[n] | lambda)
            - poisson_lccdf(0 | lambda));
}

Can you see from that what you need in generated quantities?

You should get N log_lik values in both cases. In Hurdle model there is a different if-else branch for zeros and non-zeros. In Poisson only one branch for both zeros and non-zeros.

Let me know if the above was not clear enough to write the generated quantities.

2 Likes

Ah yes! Thank you for catching and explaining my misunderstanding of the likelihood as well as catching my error in the generated quantities block.

Yes! I missed the older version warning. This, as well as Simon’s response, have helped greatly! Thank you.

An important note here is that for model comparison purposes, the above all works perfectly for families wherein the nonzero component of the response is discrete, but model comparison is more complicated when the nonzero component of the response is continuous, because the log-likelihood calculation mixes probability mass (for the data zeros) with probability density (for the data nonzeros).

Formally, whenever there is a literal zero in the data, a zero-hurdle model should be infinitely preferred over a non-hurdle continuous model, because probability masses are infinitely larger than probability densities.

Thus, to compare a hurdle-continuous family to its non-hurdle counterpart, you need to think carefully about how to weight the probability masses in the likelihood versus the probability densities. It’s often appropriate to take the log-likelihood of zeros, even in the model without the hurdle component, to be the probability mass associated with integrating the probability density over some region of practical equivalence to zero (in which case it would be best to fit the non-hurdle version of the model using this likelihood).

2 Likes