Modelling continuous positive data with strong right skew AND many zeros. Log-link AND Hurdle?

I am trying to build a model for a variable like Sales, which is continuous with a strong right skew i.e. many sales are small but a few are a couple of orders of magnitude larger. I had thought about using a gaussian linear regression with a log-link, because a simple gaussian with an identity link might suffer from heteroskedasticity where the residuals are much larger for individuals with very high sales. The issue there is that of course some individuals have no sales at all (zero values), which causes issues because log(0) is not defined. I want to avoid something hacky like adding a small constant to the target variable if I can.
After reading a little around the docs and forum, I found examples of using ‘hurdle’ models, where first the model considers the 0 and non-zero cases, and then models the non-zero cases as a typical gaussian linear regression. I’m having a hard time thinking about how to combine the two approaches, but in the interest of growth, I will post my attempt at the model code here and perhaps someone can tell me if I’m on the right track or whether the model is doing what I think it’s doing. Any guidance would be greatly appreciated, particularly in interpreting the parameters etc. If I understand correctly, gamma[1] and gamma[2] would be the intercept and slope respectively for the zero/non-zero logistic regression portion of the hurdle model, and I don’t need to explicitly code the logit-link (as I do with the log-link for the linear regression) because it is handled by the call to bernoulli_logit().

The model code;

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  vector[2] gamma;
  real<lower=0> sigma;
}
transformed parameters {
    vector[N] mu;
    vector[N] theta;
    mu = exp(alpha + x * beta);
    theta = (gamma[1] + gamma[2] * x);
}
model {
       for (n in 1:N) {
       if (y[n] == 0)
           1 ~ bernoulli_logit(theta[n]);
       else {
           0 ~ bernoulli_logit(theta[n]);
           y[n] ~ normal(mu[n], sigma);
}
}
       }

Also, I know that there may be other methods for performing this with brms in R etc., but ideally I’d like to write the model in stan code so that I can transfer to fitting it with pystan if I need to. Also, how would I calculate the log_likelihood and posterior predictive quantities in a hurdle model? For a typical linear regression it would be using normal_lpdf() and normal_rng(), but I’m unsure how to accommodate the hurdle process first.
Thanks!

EDIT: A couple of syntactical edits to the model code so it would compile.

FOLLOW UP;
I figured I would post my progress in case anyone comes across this thread later and finds it helpful. I used the following generated quantities block to try and cal log_lik;

generated quantities {
    vector[N] log_lik;
    vector[N] y_hat;
    vector[N] resid;
    
    for (i in 1:N) {
        if (y[i] == 0)
            log_lik[i] = bernoulli_logit_lpmf(1 | theta[i]);
        else {
            log_lik[i] = bernoulli_logit_lpmf(0 | theta[i]);
            log_lik[i] += normal_lpdf(y[i] | mu[i], sigma);
            }
    }
    
    for (i in 1:N){
            y_hat[i] = bernoulli_logit_rng(theta[i]);
            if (y_hat[i] == 1)
                y_hat[i] = 0;
            else {
                y_hat[i] = normal_rng(mu[i], sigma);
                }
            resid[i] = y[i] - y_hat[i];  
        }
}

I’m not 100% sure I’m making the posterior predictive draws (y_hat) correctly, but given that the model block is using theta to estimate the probability of a 0 (paradoxically giving the target as 1) it made sense to me to draw a value using bernoulli_logit_rng() first, and then replace the y_hat with 0 in the case of a success. If it is a failure (i.e. is a non-zero number), then replace it with the output of a draw from normal_rng().

It might be better to start with brms here actually, since it allows you to quickly test different link functions and families without needing to code all of the individual models. Once you find/develop a model that fits well, you can export the relevant Stan code and data from brms using make_stancode and make_standata.

Thanks for your reply. Ideally I would write the code myself in stan as that way I can learn the syntax and methods better. That aside, I’m not sure how I would do this with brms. I took a look at the docs for brms (brmsfamily function | R Documentation) and couldn’t see a hurdle gaussian with log-link listed. There is a hurdle_lognormal family available though, do you think I should use that rather than a gaussian with a log-link? Do you mind giving me an example call to the brm function?

That would be helpful but if you are able to give feedback on the model above, I’d really appreciate it.

Sure, an example brm call would be:

brm(y ~ x1 + x2 + x3, data=data, family=hurdle_gamma())

As for your model above, everything looks correct to me. How are the posterior predictive checks?

Oh, so you think I should model these data as a gamma instead?

I’m still performing the posterior predictive checks. I work in a field where there are a lot of ‘immeasurables’, so I don’t expect the model to perform too well (generally if you can train a linear model with an R-squared above 0.4, you’re doing alright. But if I can get a posterior for y_hat that looks vaguely like the data, I’ll be pleased!

That was more just a example of how to call the hurdle families. But in general yes, the Gamma distribution tends to fit positively skewed data better than a gaussian (speaking very generally). That’s why it’s a good to idea to try out a few different families/links to test performance

Ok Andrew, thanks for the suggestion to try a Gamma model. I’ll give it a go and check the fit.