If y contains n data points, n’ of which are 0’s, then this should produce n + (n-n’) log_lik values, correct? If so, running a non-hurdle model version of this (just a traditional poisson model) would produce n log_lik values. My understanding is that I can’t compare the ELPD’s of two models with ‘different data’ or a different number of log_lik values. I’m hoping to compare the hurdle model to a model of the same data without the hurdle component. Could someone correct my understanding or offer a solution?
They should be directly comparable. The likelihood for non-zero observations is the product of one minus the probability of zero and the truncated poison mass function. In other words, it’s one term per person, not two. See, for example, the users guide:
Are you looking at the old manual version? Maybe the top of the page says “This is an old version, view current version.”? The latest version does not have that code any more (I know as I removed it). I hope the current version is more clear
An important note here is that for model comparison purposes, the above all works perfectly for families wherein the nonzero component of the response is discrete, but model comparison is more complicated when the nonzero component of the response is continuous, because the log-likelihood calculation mixes probability mass (for the data zeros) with probability density (for the data nonzeros).
Formally, whenever there is a literal zero in the data, a zero-hurdle model should be infinitely preferred over a non-hurdle continuous model, because probability masses are infinitely larger than probability densities.
Thus, to compare a hurdle-continuous family to its non-hurdle counterpart, you need to think carefully about how to weight the probability masses in the likelihood versus the probability densities. It’s often appropriate to take the log-likelihood of zeros, even in the model without the hurdle component, to be the probability mass associated with integrating the probability density over some region of practical equivalence to zero (in which case it would be best to fit the non-hurdle version of the model using this likelihood).