I am trying to build a model for a variable like Sales, which is continuous with a strong right skew i.e. many sales are small but a few are a couple of orders of magnitude larger. I had thought about using a gaussian linear regression with a log-link, because a simple gaussian with an identity link might suffer from heteroskedasticity where the residuals are much larger for individuals with very high sales. The issue there is that of course some individuals have no sales at all (zero values), which causes issues because log(0) is not defined. I want to avoid something hacky like adding a small constant to the target variable if I can.
After reading a little around the docs and forum, I found examples of using ‘hurdle’ models, where first the model considers the 0 and non-zero cases, and then models the non-zero cases as a typical gaussian linear regression. I’m having a hard time thinking about how to combine the two approaches, but in the interest of growth, I will post my attempt at the model code here and perhaps someone can tell me if I’m on the right track or whether the model is doing what I think it’s doing. Any guidance would be greatly appreciated, particularly in interpreting the parameters etc. If I understand correctly, gamma[1] and gamma[2] would be the intercept and slope respectively for the zero/non-zero logistic regression portion of the hurdle model, and I don’t need to explicitly code the logit-link (as I do with the log-link for the linear regression) because it is handled by the call to bernoulli_logit().
The model code;
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
vector[2] gamma;
real<lower=0> sigma;
}
transformed parameters {
vector[N] mu;
vector[N] theta;
mu = exp(alpha + x * beta);
theta = (gamma[1] + gamma[2] * x);
}
model {
for (n in 1:N) {
if (y[n] == 0)
1 ~ bernoulli_logit(theta[n]);
else {
0 ~ bernoulli_logit(theta[n]);
y[n] ~ normal(mu[n], sigma);
}
}
}
Also, I know that there may be other methods for performing this with brms in R etc., but ideally I’d like to write the model in stan code so that I can transfer to fitting it with pystan if I need to. Also, how would I calculate the log_likelihood and posterior predictive quantities in a hurdle model? For a typical linear regression it would be using normal_lpdf() and normal_rng(), but I’m unsure how to accommodate the hurdle process first.
Thanks!
EDIT: A couple of syntactical edits to the model code so it would compile.
FOLLOW UP;
I figured I would post my progress in case anyone comes across this thread later and finds it helpful. I used the following generated quantities block to try and cal log_lik;
generated quantities {
vector[N] log_lik;
vector[N] y_hat;
vector[N] resid;
for (i in 1:N) {
if (y[i] == 0)
log_lik[i] = bernoulli_logit_lpmf(1 | theta[i]);
else {
log_lik[i] = bernoulli_logit_lpmf(0 | theta[i]);
log_lik[i] += normal_lpdf(y[i] | mu[i], sigma);
}
}
for (i in 1:N){
y_hat[i] = bernoulli_logit_rng(theta[i]);
if (y_hat[i] == 1)
y_hat[i] = 0;
else {
y_hat[i] = normal_rng(mu[i], sigma);
}
resid[i] = y[i] - y_hat[i];
}
}
I’m not 100% sure I’m making the posterior predictive draws (y_hat) correctly, but given that the model block is using theta to estimate the probability of a 0 (paradoxically giving the target as 1) it made sense to me to draw a value using bernoulli_logit_rng() first, and then replace the y_hat with 0 in the case of a success. If it is a failure (i.e. is a non-zero number), then replace it with the output of a draw from normal_rng().