Best way to calculate the mean in a regression hurdle model

How to Compute Expected Value in a Hurdle Model Without Extreme Sensitivity to Outliers?

Hi everyone,

I’m working with a regression hurdle model where the zero/non-zero process is modeled using a constant probability (theta), and the non-zero counts are modeled using a Poisson regression. However, I’m running into an issue where my computed expected values (mu) are extremely sensitive to large values of x, especially when y = 0.

Here’s a simplified version of my Stan model:

data {
    int<lower=0> N;
    array[N] int<lower=0> y;
    vector[N] x;
}
parameters {
    real<lower=0, upper=1> theta;
    vector[2] beta;
}
model {
    beta ~ normal(0, 1);
    for (n in 1:N) {
        if (y[n] == 0) {
            target += log(theta);
        } else {
            real alpha = beta[1] + beta[2] * x[n];
            target += log1m(theta) + poisson_log_lpmf(y[n] | alpha) - log1m_exp(-exp(alpha));
        }
    }
}
generated quantities {
    vector[N] mu;
    vector[N] non_zero_mu;
    for (n in 1:N) {
        real alpha = beta[1] + beta[2] * x[n];
        non_zero_mu[n] = exp(alpha - log1m_exp(-exp(alpha)));
        mu[n] = theta * non_zero_mu[n];
    }
    real avg_mu = mean(mu);
}

Here’s some test data:

{
    "N": 10,
    "y": [ 0, 0, 0, 0, 0, 1, 1, 1, 2, 2 ],
    "x": [ 5, 8, 10, 20, 30, 1, 1, 1, 2, 2 ]
}

And here’s the key part of the output:

mu[4]  = 1.3e+11
mu[5]  = 1.6e+19
non_zero_mu[4] = 5.1e+11
non_zero_mu[5] = 7.8e+19

The issue is that mu is heavily affected by extreme values of x when y = 0, leading to wildly inflated expectations. This makes my model unreliable for predicting out-of-sample data with a wide range of x values.

Questions

  1. Am I computing the expected value correctly in this hurdle model?
  2. How can I make the expected mean (mu) more robust to extreme values of x?
  • Should I modify the mean calculation in generated quantities?
  • Would a different link function or transformation be more stable?
  • Should I be incorporating more information, such as hierarchical structure or priors that regularize extreme cases?

I’d appreciate any insights or suggestions on improving robustness. Thanks in advance!

1 Like

I think you are computing the expected value correctly.

I think the problem is that your test data are very unlikely under your assumed model, so you have a data-model mismatch.

The biggest mismatch seems to be that the hurdle probability is independent of x, but in your data only low x pass the hurdle, this has probability of \frac{(5!)^2}{10!} \simeq 0.004

Then when the hurdle is cleared, you have y = x which is once again highly unlikely - there should be more noise across values with the same x.

This strong correlation between y and x should make the maximum likelihood estimate something close to beta[1] = -0.7, beta[2] = 0.7 (back of the napkin calculation), leading to pretty high non_zero_mu when x is large (e.g. x = 30 gives non_zero_mu = 6.5e8). Since low values of beta reduce the expectation less than high values of beta increase it, accounting for uncertainty should increase the expectation substantially above the maximum likelihood estimate (as you observe).

Finally your prior is actually pretty wide given the scale of x - you say that a priori having both beta[1] and beta[2] > 1 is quite possible, but that means that the expected value when the hurdle is passed and x = 30 can a priori easily be > \exp(31) \simeq 3\times 10^{13}. So the prior does not regularize your inferences at all.

In other words, the model is doing what it is expected to do - in absence of data for the case when the hurdle is cleared and x is large, it extrapolates the trend observed when x is low. That extrapolation is terrible because a) extrapolation is always hard and b) extrapolating a linear trend way beyond the observed data is usually not a good idea.

A simple fix would be to avoid making predictions for any x that is way out of range of the observed x for which the hurdle was passed. You could also fit a more flexible relationship than linear between x and the log mean of y (spline, Gaussian process, …). For datasets like the one you showed, this will however only give you ridiculously wide predictive intervals for x where you don’t have good data on y — as it should — so the result will likely be functionally the same as refusing to extrapolate.

Does that make sense?

What do you mean by calculating the mean in a regression hurdle model?

Shouldn’t this just be poisson_log_lpmf(0 | alpha) to remove the probability mass at 0 from the Poisson component? Maybe it works out to the same thing?

mu[n] = theta * non_zero_mu[n];

Shouldn’t that be (1 - theta) to match that theta is the probability of a zero value?

generated quantities {
  real avg_mu;
  { 
    vector[N] alpha = beta[1] + beta[2] * x;
    avg_mu = (1 - theta) * mean(exp(alpha - log1m_exp(-exp(alpha))))
  }

This is assuming that the expression you gave was the hurdle probability for a value > 0.
This may be numerically unstable—double exponents are nasty. So you might need to do this on the log scale using log_sum_exp.

In this toy example I’d just like to see the expected value when a plug in the predictors in the training data. In general, I’d like to see the expected value for any representative set of predictors.

Maybe it works out to the same thing?

Yes, I think it does

Shouldn’t that be (1 - theta) to match that theta is the probability of a zero value?

Yes, good catch.

A simple fix would be to avoid making predictions for any x that is way out of range of the observed x for which the hurdle was passed.

I’m not sure how to do this in general, especially when I try to plug in data outside of the training set.

You could also fit a more flexible relationship than linear between x and the log mean of y (spline, Gaussian process, …). For datasets like the one you showed, this will however only give you ridiculously wide predictive intervals for x where you don’t have good data on y — as it should — so the result will likely be functionally the same as refusing to extrapolate.

Interesting idea. Let me think about this a little while longer.

1 Like

I think this should really be quite simple. Upon training, you store min(x) and max(x) then, when predicting, you output NA whenever say new_x < 0.9 * min(x) or new_x > 1.1 * max(x). Or - if refusing to predict is not an option - you at least flag the output as extremely unreliable. Because it will be very unreliable is and you cannot really do anything about it. Unless you are dealing with some very well-behaved system with known laws (e.g. planetary orbits), reliably extrapolating much beyond the observed data is close to impossible.

Here’s my favourite example of a nicely fitting model, producing ridiculous extrapolation :

(data from Women's 400 metres world record progression - Wikipedia)

It’s even more ridiculous in that linear regression doesn’t respect the problem bounds—if you keep going you’ll get negative values in 2280 or so.

1 Like

Fair, though the model fit is basically identical on the log scale (and predicting 10 seconds in the 2300s is a little bit better than predicting negative values, but still ridiculous)

Thank you so much for the feedback!

Finally your prior is actually pretty wide given the scale of x - you say that a priori having both beta[1] and beta[2] > 1 is quite possible, but that means that the expected value when the hurdle is passed and x = 30 can a priori easily be >exp(31)≃3×1013. So the prior does not regularize your inferences at all.

It turns out that this was the core issue. I had thought that normalizing my predictors was enough. However, even when your predictors have an L2 norm of 1 they can still have extreme values that interact poorly with N(0, 1) priors.

I solved this by adjusting the scale of the prior based on the L^\infty norm of the predictor.

I’m curious how other people solve this problem. We’re using QR decomposition, but I haven’t seen much discussion of this particular issue.