Zero-Inflated models in Stan

Thanks so much for your kindly reply!

I am still a little bit confused why the baseline model is negligible when the inflated value is observed and why the mixture model always reduces to two independent contributions. May I ask that is there any reference paper on that so that I could read more?

Also when you said that ‘If you care about both then you can write a Stan program with both contributions and not worry about trying to build a mixture model.’, do you mean that I should use a Hurdle model (a two-step model) instead of a zero-inflated mixture model in Stan?

Thx!

Because a continuous distribution will never produce a zero.
Try this with R

> any(rnorm(10000000,0,1) == 0.0)
[1] FALSE

Ten million draws, not a single zero.
Some of them get close

> any(abs(rnorm(10000000,0,1)) < 1e-6)
[1] TRUE

but not that close

> any(abs(rnorm(10000000,0,1)) < 1e-7)
[1] FALSE

If you see 0.0 and not, say 0.000006172252, then you know it didn’t come from the continuous component. You have a zero-augmented but not a zero-inflated model.

However, the distinction between 0.0 and 0.000006172252 assumes sufficient measurement precision. If the values are rounded then the so-called continuous part is more accurately thought of as a discrete distribution that is being approximated by a normal distribution. If the granularity caused by rounding is large enough it’s possible to observe zeros from both processes in the mixture.
In that case the model is a zero-inflated model except that you need to add log(granularity) to the lpdf, something like

if (y[i] == 0) {
  target += log_sum_exp(
    bernoulli_lpmf(1 | theta),
    bernoulli_lpmf(0 | theta)
       + normal_lpdf(y[i] | mu, sigma)
       + log(0.001)); // y has three decimals of precision
} else {
  target += normal_lpdf(y[i] | mu, sigma) + log(0.001);
}
6 Likes

Thanks so much for your kindly reply!

So in Stan I could do zero-inflated model for mixture of discrete + continuous distribution by adding log(granularity) in the log-likelihood to ensure differentiability, am I right? Also I am confused about why you add both bernoulli_lpmf(1 | theta) and bernoulli_lpmf(0 | theta) in the target instead of only adding bernoulli_lpmf(0 | theta)?

An additional question is about Hurdle model: Could I rewrite your Stan code as a Hurdle model? Since I am thinkink that Hurdle model is a generaliation of zero-inflated model.

Thanks!

It has nothing to do with differentiability, it just puts continuous and discrete on a comparable scale. I’m not sure why the user guide talks about differentiability–maybe it’s thinking about situation where y is a parameter? Note that y as a parameter cannot work; the above requires y to be data.

It’s a typical mixture model. Keep in mind that they aren’t added together but are in different arguments to log_sum_exp.
I adapted the code from a recent post of yours since that seemed likely to be related to whatever it is you’re trying to do now.

1 Like

Thanks so much for your kindly reply!

On Stan User’s Guide Section 5.6 it says ’ Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior.’ I am not sure whether it relates to thinking y as data or parameter.

I see your point for the expression in log_sum_exp. Also I am curious about Hurdle model, for example is Hurdle model a generalization of zero-inflated model and could I rewrite your Stan code in terms of a Hurdle model?

Thanks!

Let me preface my reply by noting that hurdle models wrap a lot of really subtle probability theory, which is why they can be so confusing once you look a little deeper.

Also keep in mind that in Stan each program specifies a joint density function \pi(y, \theta) where y are the data and \theta are the parameters; a term like “hurdle model” isn’t particularly meaningful until you can write it out as Stan code. This can be complicated when popular terms actually specify ill-defined models and hence they can’t necessarily be written as Stan programs as easily as one might think! All of this we should expect models like this to be subtle.

The ultimate problem is that you can’t compare probability densities, which define continuous models, to probabilities, which define discrete models. When trying to mix continuous outcomes with discrete outcomes (i.e. inflation components) you have to be careful to compare apples to apples.

This means comparing probabilities to probabilities. For a discrete model the probability of an outcome is easy – it’s must the probability mass assigned to a point. For zero inflation that would be all of the probability at zero, P[y = 0] = 1. But for a continuous model the probability of any given point is zero! On continuous spaces we can assign finite probability essentially only to intervals. For a continuous baseline model that means P[y = 0] = 0. Consequently if you ask what probability the two components assign to y = 0 then the discrete inflation model will always assign infinitely more than the continuous baseline model, and if you observe y = 0 then you know it had to have come from the inflation model.

In Stan we can implement a zero-hurdle with a conditional statement,

if (y[n] == 0) {
  // Contributions only from inflated component
  target += log(lambda);
} else {
  // Contribution only from the baseline component
  real lpdf = baseline_lpdf(y[n] | theta);
  target += log(1 - lambda) + lpdf;
}

Because each branch considers distinct observations (y[n] == 0 for the first branch and y[n] != 0 for the second branch) we can loop over them independently. In particular because the contribution from each zero observation is the same we can write the first branch as

target += N_zero * log(lambda);

where N_zero is the number of zero observations.

Likewise we can simplify the second branch to

target += (N - N_zero) * log(1 - lambda);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

The entire model is then

target += N_zero * log(lambda);
target += (N - N_zero) * log(1 - lambda);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

Now those first two expressions simplify:

target += N_zero * log(lambda);
target += (N - N_zero) * log(1 - lambda);

is equivent to

target += N_zero * log(lambda) + (N - N_zero) * log(1 - lambda);

which is equivalent to

target += log( lambda^{N_zero} * (1 - lambda)^{(N - N_zero)} );

or

target += binomial_lpdf(lambda, N, N_zero);

In other words because y = 0 and y != 0 are modeled by separate data generating processes the entire model decomposes into a binomial model for the total number of inflated observations and a baseline model for the non-zero observations.

Again if you’re careful with the math then there is no difference between how one would implement a hurdle model or a continuous-inflated-by-zero model in Stan. What I’m referring to is that one can model everything

target += binomial_lpdf(lambda, N, N_zero);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

or just the zero observations

target += binomial_lpdf(lambda, N, N_zero);

or just the non-zero observations

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}
7 Likes

What great answers from @betanalpha and @nhuurre! I just wanted to chime in with a few details that might save you some headache depending on how you are thinking about your own model:

  1. The simplifications of the likelihood from @betanalpha’s post above will improve computational efficiency and should be used if possible. However, the are not applicable in all cases. Namely, if you want to model lambda as varying across observations depending on covariates, then you cannot convert the Bernoulli likelihood to a Binomial likelihood.
  2. When we say that the models can be treated as independent, we really mean that the data model for y == 0 conditioned on lambda is independent of the model for y != 0 conditioned on theta. If you want, it’s still fine to model non-independence between lambda and theta hierarchically, for example via correlated (i.e. multivariate) random effects.
  3. As @nhuurre indicates above, there are some scenarios where you might worry that the “continuous” distribution is yielding zeros due to the limited precision of the measurement apparatus, and there is also a solution for this. In practice it might not matter much to your inference except in the important case where the hurdle zeros are supposed to represent a fundamentally different data-generating process, and you desire inference specifically on one or the other process in isolation. In this case you might obtain biased inference if a nontrivial fraction of the zeros are actually from the discretized continuous distribution. If you just care about predictive performance, rather than parsing the two processes then this issue is less likely to pose problems, though if the discretization is severe enough it will eventually manifest as poor fit in the continuous distribution.
2 Likes

To be clear in this case there is a similar decomposition even after covariates are introduced, only instead of having independent Bernoulli and baseline models one gets a logistic regression along with the baseline model.

2 Likes

Hi, would you please explain how we should do prediction over test data using these two-part models? Of course from posterior samples we can find y^* = lambda * 0 + (1- lambda) * baseline(theta) but is that all really? I think I a missing something. sorry if question is pretty basic.

Using the covariates* associated with each test data point t we can compute posterior distributions for \lambda_t and \theta_t. Samples from the posterior predictive distribution are then obtained as follows:
For each posterior iteration i, draw a zero with probability \lambda_{t_{i}}. Otherwise draw a sample from baseline(\theta_{t_i}).

Note that this is quite different from the expression in your post, which will never yield a zero (it simplifies to (1-\lambda)baseline(\theta)).

*If there are no covariates, \lambda and \theta will be parameters in the model, and then the posterior for \lambda_t and \theta_t will just be the posteriors for those parameters.

1 Like

As @jsocolar notes prediction follows by sampling from a mixture components. With probability \lambda you’d predict the zero component, which always return zero, and with probability 1 - \lambda you’d pick the non-zero component which would return a sample from the baseline model \pi_{\text{baseline}}(\theta).

For a concrete example see “stan_programs/simu_bayesian_ensemble2.stan” in Section 4.2 of Towards A Principled Bayesian Workflow. Note that the components are switched there, so that the zero component is sample with probability 1 - \lambda and not \lambda.

1 Like

Thanks for sharing the very detailed example of yours. I have a question though. It seems to me that this model completely ignores the feature set of the items with y = 0 and so it works poorly in prediction. In my setting, I have N items each comes with D features and a representative_weight. Around 98% of items are mapped to y= 0 and the rest of 2% are mapped to a real non-negative value. Based on this thread and the example you kindly shared, I came up with the following model:

    data {
      int<lower=1> N; // number of observations, N = N_zero + N_non_zero
      int<lower=1> N_pred; // number of test points to be predicted
      int<lower=0> N_zero; 
      int<lower=0> N_non_zero; 
      int<lower=1> I; // number of predictors
      matrix[N_non_zero,I] X; // matrix of predictors
      vector[N_non_zero] y; // scaled normalized y values (for those with y!=0)
      
      matrix[N_pred,I] X_test; // matrix of predictors
    }


    transformed data {
      matrix[N_non_zero,I+1] X_intercept = append_col(rep_vector(1, N_non_zero), X);
      matrix[N_pred,I+1] X_test_intercept = append_col(rep_vector(1, N_pred), X_test);
    }

    parameters {
      vector[I+1] beta_coeff; // intercept and predictor coefficients
      real<lower=0> sigma; // scale parameter
      real<lower=0> theta;
    }
    
    model {
      beta_coeff ~ normal(0, 1);
      sigma ~ normal(0, 1);
      theta ~ beta(1, 1);
     
      target += (N_zero * log(theta));
     
     for(n in 1:(N_non_zero)) {
        target += normal_lpdf(y[n] | X_intercept[n] * beta_coeff, sigma);
        target += log(1- theta);
     }
    }
    
    generated quantities {
      vector[N_pred] y_pred= rep_vector(0,N_pred);
      for (n in 1:N_pred)
        if (!bernoulli_rng(theta))
          y_pred[n] = normal_rng(X_test_intercept[n] * beta_coeff, sigma);
   }

This model works very poorly on the held-out data especially for the minority class (non zero output) that are supper important in my use case. I suspect it is because the the feature set of items with y= 0 are ignored in training.

Really appreciate if you please help me understand the part that I am missing. Thanks

1 Like

There is nothing preventing you from modeling observation-specific theta based on covariates.
One fairly standard approach would be to use a logit-scale linear predictor.

Assuming you want to use the same feature set to predict extra zeros as you use to predict the mean conditional on a nonzero, your model could look something like (this is written to be intelligible, and not optimized for efficiency!):

data {
      int<lower=1> N; // number of observations, N = N_zero + N_non_zero
      int<lower=1> I; // number of predictors
      matrix[I] X; // matrix of predictors
      vector[N] y; // scaled normalized y values 
    }

    transformed data {
      matrix[N,I+1] X_intercept = append_col(rep_vector(1, N), X);
    }

    parameters {
      vector[I+1] beta_coeff; // intercept and predictor coefficients
      vector[I+1] beta_coef_zi; // coefficients for the logit-scale zero-inflation probability
      real<lower=0> sigma; // scale parameter
    }
    
    model {
      beta_coeff ~ normal(0, 1);
     // add an appropriate prior on beta_coeff_zi
      sigma ~ normal(0, 1);
     
     for(n in 1:(N)) {
        if (y[i] == 0) {
           target += bernoulli_logit_lpmf(0 | X_intercept[i] * beta_coeff_zi);
        } else {
           target += bernoulli_logit_lpmf(1 | X_intercept[i] * beta_coeff_zi) + normal_lpdf(y[i] | X_intercept[i] * beta_coeff, sigma);
     }
    }

For a nice treatment of such models using Stan, see the zero-inflated models section of the brms vignette on “distributional regression” here: Estimating Distributional Models with brms

And check out the Stan code that brms generates for these models.

On the other hand, if, like in @betanalpha’s example, there is a robust conceptual model of the system that leads you to believe that the source of zeros is orthogonal to the conditions that are being tested (e.g. if they arise from a detector that fails at random), then attempting to predict the excess zeros based on covariates will not be useful in practice.

7 Likes

Wanted to thank you @jsocolar for helping me understand the problem better, for the code snippet and the helpful link. Really appreciate it.

2 Likes

I was just providing an example for how to simulate from a mixture model. As @jsocolar notes if you want at the mixture probability itself to depend on the covariate then you’ll need a slightly more sophisticated model.

2 Likes

@betanalpha @nhuurre @jsocolar Thank you for your great discussions in this topic.

I am trying to model some data generated by some singers when they sang a couple of songs. The response variable is their singing deviation from the theoretical note that they were supposed to sing. My data looks like this:

As you can see there is a spike on exact zero (no deviation at all) in the original dataset (on the left) and if they are removed we get the distribution on the right. Like @stan_beginer, at first I thought I might need to model the original data using a zero inflated mixture model but after reading your great discussion now I think this could be dealt with in an easier way. So please correct me if I am wrong, as far as I understood:

1- I am dealing with two independent data generating processes, one continuous and one discrete.
2- I can simply run two separate models: one without the zero values and possibly using a Logistic likelihood (I am not sure which likelihood to use, the response variable and the residuals (after fitting a Gaussian model) are not normal) & the other one using the whole dataset but using a Bernoulli likelihood treating the response variable as zero or not zero.

I didn’t understand the necessity of fitting both of the models simultaneously if the data generating processes are independent. A clarification is highly appreciated.

1 Like

I think this is confusing the hurdle model and the zero-inflated model. Here they are without the log(0.001) modification.

Zero inflated normal

This allows 0 to be generated either by the discrete model or the continuous model.

target += log_sum_exp(bernoulli_lpmf(1 | theta),
                      bernoulli_lpmf(0 | theta)
                        + normal_lpdf(y[n] | mu, sigma));

Hurdle normal

This is pure mixture of discrete and continuous.

if (y[n] == 0)
  target += bernoulli_lpmf(1 | theta);
else
  target += bernoulli_lpmf(0 | theta)
              + normal_lpdf(y[n] | mu, sigma);
2 Likes

You are right.

If the generative processes are fully independent and do not share any parameters (e.g. covarying random effects) then it’s fine to model them separately, and can even be preferable since if there are problems in the posterior geometry separating the models will easily and definitively localize those problems to one side or the other. A potential advantage to fitting jointly is that, if using brms or similar, you get all the machinery to predict the response in one step.

2 Likes

Yes!

I agree that the continuous data looks a bit heavy-tailed here. I might start with a Student-t observational model with a relatively conservative prior on \nu^{-1}, such as \text{normal}(\nu^{-1} \mid 0, 0.11) that keeps \nu above 4-ish.

The separation here is a consequence of mixing discrete and continuous processes. Consider what would happen for two discrete processes, such as zero-inflating a Poisson model. Here the observation y = 0 has a non-zero probability of arising from both models so we can’t precisely assign that observation to one model or the other. Instead we have to fit the joint model that allows for the possibility that the zero came from either component at the same time.

1 Like

That’s the difference between zero-inflation and the hurdle model. With zero-inflated Poisson, there are two potential sources of a zero—the Poisson or the inflation. With the hurdle model, the zero always comes from the zero component of the mixture.

1 Like