Fitting a hurdle model

I am encountering a problem when fitting a hurdle model. I use simulated data and the sign of the coefficient are switched. Any hints about what I am doing wrong are appreciated.

I use the following R code to simulate the data:

set.seed(235)

# Truncated rpois
rtpois <- function(n, l) {
  qpois(runif(n, dpois(0, l), 1), l) 
}

n <- 100
beta <- c(0.5, 0.5, 0.7) # count
gamma <- c(0, 3) # zero

X <- cbind(rep(1,n), scale(runif(n,0,100)), scale(runif(n,0,10))) 
Xb <- X %*% beta 
Zg <- X[, 1:2] %*% gamma 

theta <- 1/(1 + exp(-Zg)) 
lambda <- exp(Xb) # poisson means for modeling counts

# zeros
y <- rbinom(n, 1, theta) # generate 0/1 binomial data
which_counts <- which(y == 1)

# count
y[which_counts] <- rtpois(length(which_counts), lambda[which_counts])

dat_hurdle <- data.frame(cbind(y, X))
colnames(dat_hurdle) <- c("y", "x1", "x2", "x3")

One simulated data set is available here: test-dat.txt (3.9 KB).

Estimating parameters with the pscl package in R gives expected parameter estimates:

m1 <- hurdle(y ~ x2 + x3 | x2, data = dat_hurdle, 
                     dist = "poisson", zero.dist = "binomial")
coef(m1)
## count_(Intercept)          count_x2          count_x3  zero_(Intercept)           zero_x2 
##        0.4319570         0.5509842         0.5477326        -0.2813459         2.7722614 

I then tried to implement a hurdle model with Stan (hurdle.stan (388 Bytes)).

data {
  int<lower=0> N;     // number of observations
  int<lower=0> y[N];  // response
  matrix[N, 2] X;     // covariates
}

parameters {
  vector[3] beta;
  vector[2] gamma;
}

model {
  for (n in 1:N) {
    (y[n] == 0) ~ bernoulli_logit(gamma[1] + gamma[2] * X[n, 1]);
    if (y[n] > 0) {
      y[n] ~ poisson(exp(beta[1] + beta[2] * X[n, 1] + beta[3] * X[n, 2])) T[1,]; 
    }
  }
}

Fitting the model using rstan I get

library(rstan)
fit1 <- sampling(stan_m1, chains = 2, iter = 1.5e3, 
                 data = list(y = y, X = X[, -1], N = n))
get_posterior_mean(fit1)
##            mean-chain:1 mean-chain:2 mean-all chains
## beta[1]     0.3876424    0.3987130       0.3931777
## beta[2]     0.5759098    0.5639956       0.5699527
## beta[3]     0.5559836    0.5543759       0.5551798
## gamma[1]    0.3063394    0.3024723       0.3044059
## gamma[2]   -2.9257496   -2.9057399      -2.9157447
## lp__       14.6313376   14.9631837      14.7972606

Estimates for beta[1] to beta[3] (the count part of the models) appear reasonable. For the zero part of the model parameter estimates also seem reasonable, but with switched signs (most obviously for gamma[2], i.e. the estimate fro gamma[2] is negative when it should be positive). My guess is, that I misspecified the Stan model.

Depending on the parameterization, the hurdle parameters either indicate the probability of getting a zero or of not getting a zero. Accordingly, the signs may be switched if two packages use different parameterizations. You may also fit bayesian hurdle models with the brms R package, which uses Stan as the backend.

2 Likes

Hi Paul,

many thanks for your answer and pointing me towards the brms-package, which does everything I need.

All the best,
Johannes