Hello all -
Apologies for the novel-length question. I’m relatively new to Bayesian modeling and very new to Stan. (Before this I used pymc3
.)
I have an interesting data set that I am struggling to model, since it doesn’t seem to fit neatly within any of the standard models.
The data set measures (integer) fluctuations from an expectation, so it can be both positive and negative. However, the measuring mechanism also fails with some probability, so there is an excess of zeroes. This seems like a classic “zero-inflation” problem, in that there are two ways to generate a zero:
- There was no fluctuation from the expected value.
- There was a fluctuation, but we did not observe it because the measuring mechanism failed and registered a zero instead.
In this data, there is no way to distinguish these two types of zeroes. But, this isn’t non-negative count data, so your standard zero-inflated Poisson model doesn’t apply.
I’ve thought of two ways to potentially model this data, and was hoping to get some advice from the experts about whether they are methodologically sound.
Option 1: The "Zero-Inflated Gaussian"
Here I’ve switched the Poisson distribution in the zero-inflated Poisson model for a normal distribution:
Pr(0| p, \mu, \sigma) = p + (1-p)\times Pr(0 | \mu, \sigma)
Pr((y>0 || y<0) | p, \mu, \sigma) = (1-p)\times Pr(y | \mu, \sigma)
Where Pr(y | \mu, \sigma) is the Gaussian PDF instead of the Poisson PMF and p is the probability that my measuring mechanism has failed.
The code below seems to work with my simulated data, but I’m wondering if it’s valid to blend a discrete and continuous distribution or if I’m doing something mathematically shady here.
data{
int y[300];
}
parameters{
real ap;
real am;
real<lower=0> as;
}
model{
real p;
real mu;
real sigma;
as ~ exponential( 10 );
am ~ normal( 30 , 10 );
ap ~ normal( -1.5 , 1 );
sigma = as;
sigma = exp(sigma);
mu = am;
p = ap;
p = inv_logit(p);
for ( i in 1:300 )
if ( y[i] == 0 ) target += log_mix(p, 0, normal_lpdf(0 | mu, sigma));
for ( i in 1:300 )
if ( y[i] != 0 ) target += log1m(p) + normal_lpdf(y[i] | mu, sigma);
}
}
Option 2: The “Bi-Directional” Zero-Inflated Poisson
Here I am making the assumption that the process that causes negative fluctuations is different than the process that causes positive fluctuations. (Which is actually plausible for my data.)
Pr(0 | p, \lambda_+, \lambda_-) = p + (1-p)\times \left[\nu*Pr(0 | \lambda_-) + (1-\nu)*Pr(0|\lambda_+\right]
Pr(y < 0 | p, \lambda_+, \lambda_-) = (1-p)*\nu*Pr(-y|\lambda_-)
Pr(y > 0 | p, \lambda_+, \lambda_-) = (1-p)*(1-\nu)*Pr(y|\lambda_+)
where p is the probability that my measuring mechanism has failed, \nu is the probability that the “negative fluctuation process” will occur, and Pr(y|\lambda) is the Poisson PMF.
If this is the right direction to go, I’m not sure how to code this in Stan. Can I put a log_mix
inside of a log_mix
?
Do either of these approaches ring alarm bells? Any gotchas or areas where my assumptions will break down? Potential for spectacular failure?
Any thoughts or advice people have for me are much appreciated.
Thank you!