Zero-Inflated models in Stan

Let me preface my reply by noting that hurdle models wrap a lot of really subtle probability theory, which is why they can be so confusing once you look a little deeper.

Also keep in mind that in Stan each program specifies a joint density function \pi(y, \theta) where y are the data and \theta are the parameters; a term like “hurdle model” isn’t particularly meaningful until you can write it out as Stan code. This can be complicated when popular terms actually specify ill-defined models and hence they can’t necessarily be written as Stan programs as easily as one might think! All of this we should expect models like this to be subtle.

The ultimate problem is that you can’t compare probability densities, which define continuous models, to probabilities, which define discrete models. When trying to mix continuous outcomes with discrete outcomes (i.e. inflation components) you have to be careful to compare apples to apples.

This means comparing probabilities to probabilities. For a discrete model the probability of an outcome is easy – it’s must the probability mass assigned to a point. For zero inflation that would be all of the probability at zero, P[y = 0] = 1. But for a continuous model the probability of any given point is zero! On continuous spaces we can assign finite probability essentially only to intervals. For a continuous baseline model that means P[y = 0] = 0. Consequently if you ask what probability the two components assign to y = 0 then the discrete inflation model will always assign infinitely more than the continuous baseline model, and if you observe y = 0 then you know it had to have come from the inflation model.

In Stan we can implement a zero-hurdle with a conditional statement,

if (y[n] == 0) {
  // Contributions only from inflated component
  target += log(lambda);
} else {
  // Contribution only from the baseline component
  real lpdf = baseline_lpdf(y[n] | theta);
  target += log(1 - lambda) + lpdf;
}

Because each branch considers distinct observations (y[n] == 0 for the first branch and y[n] != 0 for the second branch) we can loop over them independently. In particular because the contribution from each zero observation is the same we can write the first branch as

target += N_zero * log(lambda);

where N_zero is the number of zero observations.

Likewise we can simplify the second branch to

target += (N - N_zero) * log(1 - lambda);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

The entire model is then

target += N_zero * log(lambda);
target += (N - N_zero) * log(1 - lambda);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

Now those first two expressions simplify:

target += N_zero * log(lambda);
target += (N - N_zero) * log(1 - lambda);

is equivent to

target += N_zero * log(lambda) + (N - N_zero) * log(1 - lambda);

which is equivalent to

target += log( lambda^{N_zero} * (1 - lambda)^{(N - N_zero)} );

or

target += binomial_lpdf(lambda, N, N_zero);

In other words because y = 0 and y != 0 are modeled by separate data generating processes the entire model decomposes into a binomial model for the total number of inflated observations and a baseline model for the non-zero observations.

Again if you’re careful with the math then there is no difference between how one would implement a hurdle model or a continuous-inflated-by-zero model in Stan. What I’m referring to is that one can model everything

target += binomial_lpdf(lambda, N, N_zero);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

or just the zero observations

target += binomial_lpdf(lambda, N, N_zero);

or just the non-zero observations

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}
7 Likes