Modelling continuous positive data with strong right skew AND many zeros. Log-link AND Hurdle?

OBidz · February 9, 2021, 7:43pm

I am trying to build a model for a variable like Sales, which is continuous with a strong right skew i.e. many sales are small but a few are a couple of orders of magnitude larger. I had thought about using a gaussian linear regression with a log-link, because a simple gaussian with an identity link might suffer from heteroskedasticity where the residuals are much larger for individuals with very high sales. The issue there is that of course some individuals have no sales at all (zero values), which causes issues because log(0) is not defined. I want to avoid something hacky like adding a small constant to the target variable if I can.
After reading a little around the docs and forum, I found examples of using ‘hurdle’ models, where first the model considers the 0 and non-zero cases, and then models the non-zero cases as a typical gaussian linear regression. I’m having a hard time thinking about how to combine the two approaches, but in the interest of growth, I will post my attempt at the model code here and perhaps someone can tell me if I’m on the right track or whether the model is doing what I think it’s doing. Any guidance would be greatly appreciated, particularly in interpreting the parameters etc. If I understand correctly, gamma[1] and gamma[2] would be the intercept and slope respectively for the zero/non-zero logistic regression portion of the hurdle model, and I don’t need to explicitly code the logit-link (as I do with the log-link for the linear regression) because it is handled by the call to bernoulli_logit().

The model code;

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  vector[2] gamma;
  real<lower=0> sigma;
}
transformed parameters {
    vector[N] mu;
    vector[N] theta;
    mu = exp(alpha + x * beta);
    theta = (gamma[1] + gamma[2] * x);
}
model {
       for (n in 1:N) {
       if (y[n] == 0)
           1 ~ bernoulli_logit(theta[n]);
       else {
           0 ~ bernoulli_logit(theta[n]);
           y[n] ~ normal(mu[n], sigma);
}
}
       }

Also, I know that there may be other methods for performing this with brms in R etc., but ideally I’d like to write the model in stan code so that I can transfer to fitting it with pystan if I need to. Also, how would I calculate the log_likelihood and posterior predictive quantities in a hurdle model? For a typical linear regression it would be using normal_lpdf() and normal_rng(), but I’m unsure how to accommodate the hurdle process first.
Thanks!

EDIT: A couple of syntactical edits to the model code so it would compile.

FOLLOW UP;
I figured I would post my progress in case anyone comes across this thread later and finds it helpful. I used the following generated quantities block to try and cal log_lik;

generated quantities {
    vector[N] log_lik;
    vector[N] y_hat;
    vector[N] resid;
    
    for (i in 1:N) {
        if (y[i] == 0)
            log_lik[i] = bernoulli_logit_lpmf(1 | theta[i]);
        else {
            log_lik[i] = bernoulli_logit_lpmf(0 | theta[i]);
            log_lik[i] += normal_lpdf(y[i] | mu[i], sigma);
            }
    }
    
    for (i in 1:N){
            y_hat[i] = bernoulli_logit_rng(theta[i]);
            if (y_hat[i] == 1)
                y_hat[i] = 0;
            else {
                y_hat[i] = normal_rng(mu[i], sigma);
                }
            resid[i] = y[i] - y_hat[i];  
        }
}

I’m not 100% sure I’m making the posterior predictive draws (y_hat) correctly, but given that the model block is using theta to estimate the probability of a 0 (paradoxically giving the target as 1) it made sense to me to draw a value using bernoulli_logit_rng() first, and then replace the y_hat with 0 in the case of a success. If it is a failure (i.e. is a non-zero number), then replace it with the output of a draw from normal_rng().

andrjohns · February 9, 2021, 11:31pm

It might be better to start with brms here actually, since it allows you to quickly test different link functions and families without needing to code all of the individual models. Once you find/develop a model that fits well, you can export the relevant Stan code and data from brms using make_stancode and make_standata.

OBidz · February 9, 2021, 11:43pm

Thanks for your reply. Ideally I would write the code myself in stan as that way I can learn the syntax and methods better. That aside, I’m not sure how I would do this with brms. I took a look at the docs for brms (brmsfamily function | R Documentation) and couldn’t see a hurdle gaussian with log-link listed. There is a hurdle_lognormal family available though, do you think I should use that rather than a gaussian with a log-link? Do you mind giving me an example call to the brm function?

That would be helpful but if you are able to give feedback on the model above, I’d really appreciate it.

andrjohns · February 10, 2021, 12:10am

Sure, an example brm call would be:

brm(y ~ x1 + x2 + x3, data=data, family=hurdle_gamma())

As for your model above, everything looks correct to me. How are the posterior predictive checks?

OBidz · February 10, 2021, 12:22am

Oh, so you think I should model these data as a gamma instead?

I’m still performing the posterior predictive checks. I work in a field where there are a lot of ‘immeasurables’, so I don’t expect the model to perform too well (generally if you can train a linear model with an R-squared above 0.4, you’re doing alright. But if I can get a posterior for y_hat that looks vaguely like the data, I’ll be pleased!

andrjohns · February 10, 2021, 12:31am

That was more just a example of how to call the hurdle families. But in general yes, the Gamma distribution tends to fit positively skewed data better than a gaussian (speaking very generally). That’s why it’s a good to idea to try out a few different families/links to test performance

OBidz · February 10, 2021, 12:43am

Ok Andrew, thanks for the suggestion to try a Gamma model. I’ll give it a go and check the fit.

Topic		Replies	Views
Modelling "hurdle lognormal" vs "hurdle gamma" on data with many zeros Modeling specification , brms	4	93	March 21, 2025
Brms: does the lognormal part of the hurdle_lognormal() regression include zeros into analysis? Modeling	8	2240	September 27, 2020
Gamma Hurdle Coding Issue? Modeling	1	403	June 18, 2019
Why isn't there a hurdle gaussian family? brms	1	911	May 21, 2020
Hurdle lognormal distribution Modeling mixture	11	2380	October 9, 2020

Modelling continuous positive data with strong right skew AND many zeros. Log-link AND Hurdle?

Related topics