How to Compute Expected Value in a Hurdle Model Without Extreme Sensitivity to Outliers?
Hi everyone,
I’m working with a regression hurdle model where the zero/non-zero process is modeled using a constant probability (theta
), and the non-zero counts are modeled using a Poisson regression. However, I’m running into an issue where my computed expected values (mu
) are extremely sensitive to large values of x
, especially when y = 0
.
Here’s a simplified version of my Stan model:
data {
int<lower=0> N;
array[N] int<lower=0> y;
vector[N] x;
}
parameters {
real<lower=0, upper=1> theta;
vector[2] beta;
}
model {
beta ~ normal(0, 1);
for (n in 1:N) {
if (y[n] == 0) {
target += log(theta);
} else {
real alpha = beta[1] + beta[2] * x[n];
target += log1m(theta) + poisson_log_lpmf(y[n] | alpha) - log1m_exp(-exp(alpha));
}
}
}
generated quantities {
vector[N] mu;
vector[N] non_zero_mu;
for (n in 1:N) {
real alpha = beta[1] + beta[2] * x[n];
non_zero_mu[n] = exp(alpha - log1m_exp(-exp(alpha)));
mu[n] = theta * non_zero_mu[n];
}
real avg_mu = mean(mu);
}
Here’s some test data:
{
"N": 10,
"y": [ 0, 0, 0, 0, 0, 1, 1, 1, 2, 2 ],
"x": [ 5, 8, 10, 20, 30, 1, 1, 1, 2, 2 ]
}
And here’s the key part of the output:
mu[4] = 1.3e+11
mu[5] = 1.6e+19
non_zero_mu[4] = 5.1e+11
non_zero_mu[5] = 7.8e+19
The issue is that mu
is heavily affected by extreme values of x
when y = 0
, leading to wildly inflated expectations. This makes my model unreliable for predicting out-of-sample data with a wide range of x
values.
Questions
- Am I computing the expected value correctly in this hurdle model?
- How can I make the expected mean (
mu
) more robust to extreme values ofx
?
- Should I modify the mean calculation in
generated quantities
? - Would a different link function or transformation be more stable?
- Should I be incorporating more information, such as hierarchical structure or priors that regularize extreme cases?
I’d appreciate any insights or suggestions on improving robustness. Thanks in advance!