Zero-inflated negative and positive data - Zero-Inflated Gaussian?

Hello all -

Apologies for the novel-length question. I’m relatively new to Bayesian modeling and very new to Stan. (Before this I used pymc3.)

I have an interesting data set that I am struggling to model, since it doesn’t seem to fit neatly within any of the standard models.

The data set measures (integer) fluctuations from an expectation, so it can be both positive and negative. However, the measuring mechanism also fails with some probability, so there is an excess of zeroes. This seems like a classic “zero-inflation” problem, in that there are two ways to generate a zero:

  1. There was no fluctuation from the expected value.
  2. There was a fluctuation, but we did not observe it because the measuring mechanism failed and registered a zero instead.

In this data, there is no way to distinguish these two types of zeroes. But, this isn’t non-negative count data, so your standard zero-inflated Poisson model doesn’t apply.

I’ve thought of two ways to potentially model this data, and was hoping to get some advice from the experts about whether they are methodologically sound.


Option 1: The "Zero-Inflated Gaussian"

Here I’ve switched the Poisson distribution in the zero-inflated Poisson model for a normal distribution:

Pr(0| p, \mu, \sigma) = p + (1-p)\times Pr(0 | \mu, \sigma)

Pr((y>0 || y<0) | p, \mu, \sigma) = (1-p)\times Pr(y | \mu, \sigma)

Where Pr(y | \mu, \sigma) is the Gaussian PDF instead of the Poisson PMF and p is the probability that my measuring mechanism has failed.

The code below seems to work with my simulated data, but I’m wondering if it’s valid to blend a discrete and continuous distribution or if I’m doing something mathematically shady here.

data{
    int y[300];
}
parameters{
    real ap;
    real am;
    real<lower=0> as;
}
model{
    real p;
    real mu;
    real sigma;
    as ~ exponential( 10 );
    am ~ normal( 30 , 10 );
    ap ~ normal( -1.5 , 1 );
    sigma = as;
    sigma = exp(sigma);
    mu = am;
    p = ap;
    p = inv_logit(p);
    for ( i in 1:300 ) 
        if ( y[i] == 0 ) target += log_mix(p, 0, normal_lpdf(0 | mu, sigma));
    for ( i in 1:300 ) 
        if ( y[i] != 0 ) target += log1m(p) + normal_lpdf(y[i] | mu, sigma);
}
} 

Option 2: The “Bi-Directional” Zero-Inflated Poisson

Here I am making the assumption that the process that causes negative fluctuations is different than the process that causes positive fluctuations. (Which is actually plausible for my data.)

Pr(0 | p, \lambda_+, \lambda_-) = p + (1-p)\times \left[\nu*Pr(0 | \lambda_-) + (1-\nu)*Pr(0|\lambda_+\right]

Pr(y < 0 | p, \lambda_+, \lambda_-) = (1-p)*\nu*Pr(-y|\lambda_-)

Pr(y > 0 | p, \lambda_+, \lambda_-) = (1-p)*(1-\nu)*Pr(y|\lambda_+)

where p is the probability that my measuring mechanism has failed, \nu is the probability that the “negative fluctuation process” will occur, and Pr(y|\lambda) is the Poisson PMF.

If this is the right direction to go, I’m not sure how to code this in Stan. Can I put a log_mix inside of a log_mix?


Do either of these approaches ring alarm bells? Any gotchas or areas where my assumptions will break down? Potential for spectacular failure?

Any thoughts or advice people have for me are much appreciated.

Thank you!

1 Like

Mixing discrete and continuous is indeed a problem and I doubt the model you coded is able to recover p consistently. Try fitting simulated data with different mu and sigma.
How is your simulation getting integers out of a gaussian, anyway?
If the idea is to simply round to the nearest integer then the effective PMF comes from Gaussian CDF, for example:

for ( i in 1:300 ) {
  real lpmf = log_diff_exp(
                normal_lcdf(y[i]+0.5|mu,sigma),
                normal_lcdf(y[i]-0.5|mu,sigma));
  if ( y[i] == 0 ) {
    target += log_mix(p, 0, lpmf);
  } else {
    target += log1m(p) + lpmf;
  }
}

The CDF derived PMF avoids the problem of mixing continuous and discrete distributions.

Two Poissons looks more reasonable. Yes, you can put log_mix inside log_mix.

for ( i in 1:300 ) {
  if ( y[i] == 0 ) {
    real lm = log_mix(nu,
                poisson_lpmf(0| lambda_minus),
                poisson_lpmf(0| lambda_plus))
    target += log_mix(p, 0.0, lm);    
  }
  if ( y[i] < 0 ) {
    target += log1m(p) + log(nu)
             + poisson_lpmf(-y[i]| lambda_minus);
  }
  if ( y[i] > 0 ) {
    target += log1m(p) + log1m(nu)
             + poisson_lpmf(y[i]| lambda_plus);
  }
}

Although this assumes the two fluctuation processes are mutually exclusive.

So here’s a third option: what about zero-inflated Skellam distribution? It’s not native in Stan but can be implemented as a user-defined function.

functions {
  real skellam_lpmf(int y , real mu1 , real mu2) {
    return -mu1 - mu2 + 0.5*y*log(mu1/mu2)
             + log(2*modified_bessel_first_kind(y, sqrt(mu1*mu2)));
  }
}

Hello and thank you! All of your suggestions and ideas are great.

To follow-up on a few questions and comments.

  1. I was not getting integers out of my simulation, as you note. I was rounding to the nearest integer, which I think is okay, given the context of my data. If I understand correctly, what you are doing with the log_diff_exp function is creating a discrete probability mass function out of the continuous PDF. Then the function lpmf is discrete, like a Poisson, and can be substituted into the zero-inflation equations. Do you use log_diff_exp instead of taking a simple difference in order to deal with rounding and very small/large numbers?

  2. Thank you for the Stan example! That’s enough to get me started, if I go this route. As you note, it assumes the two fluctuation processes are mutually exclusive, which they are not. (Maybe if you squint, but in reality, I think the answer is no.)

  3. Somehow, in all my Googling, I did not encounter the Skellam distribution! I’ll need to do a bit more research, but this seems potentially like exactly what I was looking for!

I’ll try to follow-up when I’ve had some time to run a few more simulations (probably not tonight), but I just want to say thank you! This has been so helpful.

Yes. Initialization may fail if the calculation is not numerically stable.
It may be even better to use CCDF in the right tail.

if (y < mu) {
  ... = log_diff_exp(normal_lcdf(y[i]+0.5|mu,sigma),
                     normal_lcdf(y[i]-0.5|mu,sigma));
} else {
  ... = log_diff_exp(normal_lccdf(y[i]-0.5|mu,sigma),
                     normal_lccdf(y[i]+0.5|mu,sigma));
}