Zero-Inflated models in Stan

Hi,

I am working on a zero inflated log-normal model and I have read the Stan User Guide which said that ‘Zero inflation does not work for continuous distribution in Stan because of issues of derivatives’. May I ask that in this way how could we implemement such model?

A more general question is that could we think zero-inflated model is a special case of two-component mixture model, where one mixing distribution is Bernouli (for inflated zeros) and the other mixing distribution is Log Normal?

Also I know zero-inflated Poisson but never encountered zero-inflated continuous distribution before, what does it mean to have continuous zero-inflated model given any point mass in continuous pdf is 0? Why we might need such continuous zero-inflated model?

Thanks so much!

3 Likes

Zero-augmented models come in different guises depending on how we treat the zeros: the zeros could all come from a completely different process to the rest of the data (e.g. some people didn’t bother to press the button during a reaction time experiment) or the zeros could be a mixture of two data-generating processes (e.g. in wildlife abundance surveys, zeros could mean there are no species in an area or they just weren’t observed).

For the latter, the zeros are often term ‘inflated’ because some proportion of them come from a different process, and hence we need a mixture of two distributions that can produce zeros. Hence, the zero-inflated Poisson or Negative Binomial distributions, which have point mass at zero.

For the former case, the zeros are not inflated, they are just from a different process, and therefore we do not a mixture of two distributions (in the classic sense), only a statement in the likelihood to choose a (for instance) Bernoulli distribution to model the zeros separately from the rest of the data. These are often referred to as hurdle models (e.g. someone needs to surpass the ‘hurdle’ of motivation to press the button in the reaction time experiment).

If your data is right-skewed and continuous, it cannot have point mass directly at zero (from the definition of probability density), so you can only choose a hurdle-style model. It makes no sense to say that the zeros might also come from the log-normal component, because the log-normal distribution only has support (0, \infty].

This was a long way of describing what I think you already were intuitively realising, but you were being bitten a bit by terminology.

7 Likes

Thanks so much for your kindly reply!

I think my situation is more like the latter, where the zeros are mixture of two data-generating processes (eg, the measure could be zero indicating there is truly nothing, or the measure could be zero indicating the measures are simply not observed or just too low to measure so we get zero). But the tricky point is that one of my mixture distribution is continuous (maybe log-normal is not a good example, so let’s think it as normal distribution). I am wondering whether in this case Stan could model a mixture of discrete distribution and a continuous distribution for the latter ‘inflated’ case you mentioned?

Thx!

1 Like

Inflation as a mixture model becomes…complicated when the baseline model is continuous and not discrete. In this case the relative probability of observing the inflated value for the (discrete) inflation model verses the (continuous) baseline model is infinity; in other words if you see the inflated value then you basically know that the probably of the inflated component is 1and the probability of the baseline component is zero. The technical details require carefully differentiating between probabilities and probability densities.

Fortunately the fact that the baseline model is negligible when the inflated value is observed means that the mixture model always reduces to two independent contributions – a binomial model for observing inflated values verses non inflated values and the baseline model for the non-inflated values. If you only care about how often you see an inflated value then you just need the binomial model; if you only care about the non-inflated values then you can ignore the observations at the inflated values entirely. If you care about both then you can write a Stan program with both contributions and not worry about trying to build a mixture model.

5 Likes

Thanks so much for your kindly reply!

I am still a little bit confused why the baseline model is negligible when the inflated value is observed and why the mixture model always reduces to two independent contributions. May I ask that is there any reference paper on that so that I could read more?

Also when you said that ‘If you care about both then you can write a Stan program with both contributions and not worry about trying to build a mixture model.’, do you mean that I should use a Hurdle model (a two-step model) instead of a zero-inflated mixture model in Stan?

Thx!

Because a continuous distribution will never produce a zero.
Try this with R

> any(rnorm(10000000,0,1) == 0.0)
[1] FALSE

Ten million draws, not a single zero.
Some of them get close

> any(abs(rnorm(10000000,0,1)) < 1e-6)
[1] TRUE

but not that close

> any(abs(rnorm(10000000,0,1)) < 1e-7)
[1] FALSE

If you see 0.0 and not, say 0.000006172252, then you know it didn’t come from the continuous component. You have a zero-augmented but not a zero-inflated model.

However, the distinction between 0.0 and 0.000006172252 assumes sufficient measurement precision. If the values are rounded then the so-called continuous part is more accurately thought of as a discrete distribution that is being approximated by a normal distribution. If the granularity caused by rounding is large enough it’s possible to observe zeros from both processes in the mixture.
In that case the model is a zero-inflated model except that you need to add log(granularity) to the lpdf, something like

if (y[i] == 0) {
  target += log_sum_exp(
    bernoulli_lpmf(1 | theta),
    bernoulli_lpmf(0 | theta)
       + normal_lpdf(y[i] | mu, sigma)
       + log(0.001)); // y has three decimals of precision
} else {
  target += normal_lpdf(y[i] | mu, sigma) + log(0.001);
}
6 Likes

Thanks so much for your kindly reply!

So in Stan I could do zero-inflated model for mixture of discrete + continuous distribution by adding log(granularity) in the log-likelihood to ensure differentiability, am I right? Also I am confused about why you add both bernoulli_lpmf(1 | theta) and bernoulli_lpmf(0 | theta) in the target instead of only adding bernoulli_lpmf(0 | theta)?

An additional question is about Hurdle model: Could I rewrite your Stan code as a Hurdle model? Since I am thinkink that Hurdle model is a generaliation of zero-inflated model.

Thanks!

It has nothing to do with differentiability, it just puts continuous and discrete on a comparable scale. I’m not sure why the user guide talks about differentiability–maybe it’s thinking about situation where y is a parameter? Note that y as a parameter cannot work; the above requires y to be data.

It’s a typical mixture model. Keep in mind that they aren’t added together but are in different arguments to log_sum_exp.
I adapted the code from a recent post of yours since that seemed likely to be related to whatever it is you’re trying to do now.

1 Like

Thanks so much for your kindly reply!

On Stan User’s Guide Section 5.6 it says ’ Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior.’ I am not sure whether it relates to thinking y as data or parameter.

I see your point for the expression in log_sum_exp. Also I am curious about Hurdle model, for example is Hurdle model a generalization of zero-inflated model and could I rewrite your Stan code in terms of a Hurdle model?

Thanks!

Let me preface my reply by noting that hurdle models wrap a lot of really subtle probability theory, which is why they can be so confusing once you look a little deeper.

Also keep in mind that in Stan each program specifies a joint density function \pi(y, \theta) where y are the data and \theta are the parameters; a term like “hurdle model” isn’t particularly meaningful until you can write it out as Stan code. This can be complicated when popular terms actually specify ill-defined models and hence they can’t necessarily be written as Stan programs as easily as one might think! All of this we should expect models like this to be subtle.

The ultimate problem is that you can’t compare probability densities, which define continuous models, to probabilities, which define discrete models. When trying to mix continuous outcomes with discrete outcomes (i.e. inflation components) you have to be careful to compare apples to apples.

This means comparing probabilities to probabilities. For a discrete model the probability of an outcome is easy – it’s must the probability mass assigned to a point. For zero inflation that would be all of the probability at zero, P[y = 0] = 1. But for a continuous model the probability of any given point is zero! On continuous spaces we can assign finite probability essentially only to intervals. For a continuous baseline model that means P[y = 0] = 0. Consequently if you ask what probability the two components assign to y = 0 then the discrete inflation model will always assign infinitely more than the continuous baseline model, and if you observe y = 0 then you know it had to have come from the inflation model.

In Stan we can implement a zero-hurdle with a conditional statement,

if (y[n] == 0) {
  // Contributions only from inflated component
  target += log(lambda);
} else {
  // Contribution only from the baseline component
  real lpdf = baseline_lpdf(y[n] | theta);
  target += log(1 - lambda) + lpdf;
}

Because each branch considers distinct observations (y[n] == 0 for the first branch and y[n] != 0 for the second branch) we can loop over them independently. In particular because the contribution from each zero observation is the same we can write the first branch as

target += N_zero * log(lambda);

where N_zero is the number of zero observations.

Likewise we can simplify the second branch to

target += (N - N_zero) * log(1 - lambda);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

The entire model is then

target += N_zero * log(lambda);
target += (N - N_zero) * log(1 - lambda);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

Now those first two expressions simplify:

target += N_zero * log(lambda);
target += (N - N_zero) * log(1 - lambda);

is equivent to

target += N_zero * log(lambda) + (N - N_zero) * log(1 - lambda);

which is equivalent to

target += log( lambda^{N_zero} * (1 - lambda)^{(N - N_zero)} );

or

target += binomial_lpdf(lambda, N, N_zero);

In other words because y = 0 and y != 0 are modeled by separate data generating processes the entire model decomposes into a binomial model for the total number of inflated observations and a baseline model for the non-zero observations.

Again if you’re careful with the math then there is no difference between how one would implement a hurdle model or a continuous-inflated-by-zero model in Stan. What I’m referring to is that one can model everything

target += binomial_lpdf(lambda, N, N_zero);

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}

or just the zero observations

target += binomial_lpdf(lambda, N, N_zero);

or just the non-zero observations

for (n in 1:N)
  if (y[n] != 0) {
     real lpdf = baseline_lpdf(y[n] | theta);
     target += lpdf;
  }
}
7 Likes

What great answers from @betanalpha and @nhuurre! I just wanted to chime in with a few details that might save you some headache depending on how you are thinking about your own model:

  1. The simplifications of the likelihood from @betanalpha’s post above will improve computational efficiency and should be used if possible. However, the are not applicable in all cases. Namely, if you want to model lambda as varying across observations depending on covariates, then you cannot convert the Bernoulli likelihood to a Binomial likelihood.
  2. When we say that the models can be treated as independent, we really mean that the data model for y == 0 conditioned on lambda is independent of the model for y != 0 conditioned on theta. If you want, it’s still fine to model non-independence between lambda and theta hierarchically, for example via correlated (i.e. multivariate) random effects.
  3. As @nhuurre indicates above, there are some scenarios where you might worry that the “continuous” distribution is yielding zeros due to the limited precision of the measurement apparatus, and there is also a solution for this. In practice it might not matter much to your inference except in the important case where the hurdle zeros are supposed to represent a fundamentally different data-generating process, and you desire inference specifically on one or the other process in isolation. In this case you might obtain biased inference if a nontrivial fraction of the zeros are actually from the discretized continuous distribution. If you just care about predictive performance, rather than parsing the two processes then this issue is less likely to pose problems, though if the discretization is severe enough it will eventually manifest as poor fit in the continuous distribution.
2 Likes

To be clear in this case there is a similar decomposition even after covariates are introduced, only instead of having independent Bernoulli and baseline models one gets a logistic regression along with the baseline model.

2 Likes

Hi, would you please explain how we should do prediction over test data using these two-part models? Of course from posterior samples we can find y^* = lambda * 0 + (1- lambda) * baseline(theta) but is that all really? I think I a missing something. sorry if question is pretty basic.

Using the covariates* associated with each test data point t we can compute posterior distributions for \lambda_t and \theta_t. Samples from the posterior predictive distribution are then obtained as follows:
For each posterior iteration i, draw a zero with probability \lambda_{t_{i}}. Otherwise draw a sample from baseline(\theta_{t_i}).

Note that this is quite different from the expression in your post, which will never yield a zero (it simplifies to (1-\lambda)baseline(\theta)).

*If there are no covariates, \lambda and \theta will be parameters in the model, and then the posterior for \lambda_t and \theta_t will just be the posteriors for those parameters.

1 Like

As @jsocolar notes prediction follows by sampling from a mixture components. With probability \lambda you’d predict the zero component, which always return zero, and with probability 1 - \lambda you’d pick the non-zero component which would return a sample from the baseline model \pi_{\text{baseline}}(\theta).

For a concrete example see “stan_programs/simu_bayesian_ensemble2.stan” in Section 4.2 of Towards A Principled Bayesian Workflow. Note that the components are switched there, so that the zero component is sample with probability 1 - \lambda and not \lambda.

1 Like

Thanks for sharing the very detailed example of yours. I have a question though. It seems to me that this model completely ignores the feature set of the items with y = 0 and so it works poorly in prediction. In my setting, I have N items each comes with D features and a representative_weight. Around 98% of items are mapped to y= 0 and the rest of 2% are mapped to a real non-negative value. Based on this thread and the example you kindly shared, I came up with the following model:

    data {
      int<lower=1> N; // number of observations, N = N_zero + N_non_zero
      int<lower=1> N_pred; // number of test points to be predicted
      int<lower=0> N_zero; 
      int<lower=0> N_non_zero; 
      int<lower=1> I; // number of predictors
      matrix[N_non_zero,I] X; // matrix of predictors
      vector[N_non_zero] y; // scaled normalized y values (for those with y!=0)
      
      matrix[N_pred,I] X_test; // matrix of predictors
    }


    transformed data {
      matrix[N_non_zero,I+1] X_intercept = append_col(rep_vector(1, N_non_zero), X);
      matrix[N_pred,I+1] X_test_intercept = append_col(rep_vector(1, N_pred), X_test);
    }

    parameters {
      vector[I+1] beta_coeff; // intercept and predictor coefficients
      real<lower=0> sigma; // scale parameter
      real<lower=0> theta;
    }
    
    model {
      beta_coeff ~ normal(0, 1);
      sigma ~ normal(0, 1);
      theta ~ beta(1, 1);
     
      target += (N_zero * log(theta));
     
     for(n in 1:(N_non_zero)) {
        target += normal_lpdf(y[n] | X_intercept[n] * beta_coeff, sigma);
        target += log(1- theta);
     }
    }
    
    generated quantities {
      vector[N_pred] y_pred= rep_vector(0,N_pred);
      for (n in 1:N_pred)
        if (!bernoulli_rng(theta))
          y_pred[n] = normal_rng(X_test_intercept[n] * beta_coeff, sigma);
   }

This model works very poorly on the held-out data especially for the minority class (non zero output) that are supper important in my use case. I suspect it is because the the feature set of items with y= 0 are ignored in training.

Really appreciate if you please help me understand the part that I am missing. Thanks

1 Like

There is nothing preventing you from modeling observation-specific theta based on covariates.
One fairly standard approach would be to use a logit-scale linear predictor.

Assuming you want to use the same feature set to predict extra zeros as you use to predict the mean conditional on a nonzero, your model could look something like (this is written to be intelligible, and not optimized for efficiency!):

data {
      int<lower=1> N; // number of observations, N = N_zero + N_non_zero
      int<lower=1> I; // number of predictors
      matrix[I] X; // matrix of predictors
      vector[N] y; // scaled normalized y values 
    }

    transformed data {
      matrix[N,I+1] X_intercept = append_col(rep_vector(1, N), X);
    }

    parameters {
      vector[I+1] beta_coeff; // intercept and predictor coefficients
      vector[I+1] beta_coef_zi; // coefficients for the logit-scale zero-inflation probability
      real<lower=0> sigma; // scale parameter
    }
    
    model {
      beta_coeff ~ normal(0, 1);
     // add an appropriate prior on beta_coeff_zi
      sigma ~ normal(0, 1);
     
     for(n in 1:(N)) {
        if (y[i] == 0) {
           target += bernoulli_logit_lpmf(0 | X_intercept[i] * beta_coeff_zi);
        } else {
           target += bernoulli_logit_lpmf(1 | X_intercept[i] * beta_coeff_zi) + normal_lpdf(y[i] | X_intercept[i] * beta_coeff, sigma);
     }
    }

For a nice treatment of such models using Stan, see the zero-inflated models section of the brms vignette on “distributional regression” here: Estimating Distributional Models with brms

And check out the Stan code that brms generates for these models.

On the other hand, if, like in @betanalpha’s example, there is a robust conceptual model of the system that leads you to believe that the source of zeros is orthogonal to the conditions that are being tested (e.g. if they arise from a detector that fails at random), then attempting to predict the excess zeros based on covariates will not be useful in practice.

7 Likes

Wanted to thank you @jsocolar for helping me understand the problem better, for the code snippet and the helpful link. Really appreciate it.

2 Likes

I was just providing an example for how to simulate from a mixture model. As @jsocolar notes if you want at the mixture probability itself to depend on the covariate then you’ll need a slightly more sophisticated model.

2 Likes

@betanalpha @nhuurre @jsocolar Thank you for your great discussions in this topic.

I am trying to model some data generated by some singers when they sang a couple of songs. The response variable is their singing deviation from the theoretical note that they were supposed to sing. My data looks like this:

As you can see there is a spike on exact zero (no deviation at all) in the original dataset (on the left) and if they are removed we get the distribution on the right. Like @stan_beginer, at first I thought I might need to model the original data using a zero inflated mixture model but after reading your great discussion now I think this could be dealt with in an easier way. So please correct me if I am wrong, as far as I understood:

1- I am dealing with two independent data generating processes, one continuous and one discrete.
2- I can simply run two separate models: one without the zero values and possibly using a Logistic likelihood (I am not sure which likelihood to use, the response variable and the residuals (after fitting a Gaussian model) are not normal) & the other one using the whole dataset but using a Bernoulli likelihood treating the response variable as zero or not zero.

I didn’t understand the necessity of fitting both of the models simultaneously if the data generating processes are independent. A clarification is highly appreciated.

1 Like