Coding hurdle model with censored data

I’m trying to fit left-censored data with many zeros. My plan was to use a log-normal hurdle model, and in the log-normal process, I was hoping I could impute the censored observations following the manual here. No luck however in recovering the right parameters as I’m not sure how to index the hurdle part in the model block.

To test out my model, I generated positive values from a lognormal distribution. A portion of these were then removed at probability = pZero, i.e. turned into zeros. Then I censored all present values below a threshold (L).

# simulate data
N <- 1000
mu <- 0
sigma <- 1
y_raw <- rlnorm(N, mu, sd = sigma)
dens(y_raw)

# add in other processes
L <- 1 #censor cutoff--below this value, shows up as zero. 
pZero <- .5 #probability of actually being absent. 
present <- rbinom(N, 1, pZero)
y_obs <- y_raw[y_raw > L & present == 1]

# make data list for stan
dataList <- list(
  N = N, # total observations
  N_obs = length(y_obs), 
  N_cens = sum(present) - length(y_obs),
  present = present,
  y_obs = y_obs
)
str(dataList)

The Stan code I’m using is below:

data {
  int N;
  int N_obs;
  int N_cens;
  array[N] int present;
  array[N_obs] real<lower=0> y_obs;

}
parameters {
  real mu;
  real<lower=0> sigma;
  real theta;
  array[N_cens] real<lower=0> y_cens;
}
model {
  // priors
  theta ~ normal(0, 1);
  mu ~ normal(0, 1);
  sigma ~ normal(0, 1);
  // likelihood
  for(i in 1:N){
    if(present[i] == 0){
      target += log(inv_logit(theta));
    }else{
      target += log1m(inv_logit(theta));
      y_cens ~ lognormal(mu, sigma);
      y_obs ~ lognormal(mu, sigma);
    }
  }
}

The model does run, but the posteriors for mu and sigma are really tight and not anywhere near 0 and 1 (90%CI 0.223-0.229 and 0.692-0.696 on my latest run).

Any tips on how to properly code the model block? Without censored data, I wouldn’t use the data variable present, but I couldn’t figure out a different way. Clearly mine isn’t working.

I found out that brms has the capabilities of dealing with censored data in a hurdle model, e.g. here they ran a censored gamma hurdle model. I tried it out with some toy lognormal data (see below), but I still can’t recover the parameters. Anyone have insight?


rm(list = ls())
n <- 200
mu <- 0
sigma <- 1
p <- .2 #probability of zero. 
L <- .5 #lower bound
data <- tibble(y_uncens = ifelse(rbinom(n, 1, 1-p) == 1, 
       rlnorm(n, mu, sigma),
       0)) %>% 
  mutate(y_cens = ifelse(y_uncens < L & y_uncens != 0, L, y_uncens), 
         censored = ifelse(y_uncens < L & y_uncens != 0, 'left', 'none'))
print(data, n = 20) #y_uncens = before censoring, y_cens = after censoring


# model it? 
# 1) make sure the censored model is working: non-zeros, censored data
m1 <- brm(y_cens | cens(censored) ~ 1, 
          family = lognormal(), 
          data = filter(data, y_cens > 0)) 
m1; mu; sigma # looks good


# 2) make sure hurdle model is working: hurdle model, uncensored data
m2 <- brm(y_uncens ~ 1, family = hurdle_lognormal(), data = data) 
m2; mu; sigma; p # looks good


# 3) hurdle model with censored data
m3 <- brm(y_cens | cens(censored) ~ 1, family = hurdle_lognormal(), data = data) 
m3; mu; sigma; p # doesn't recover parameters :(
# Population-Level Effects: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# Intercept     0.34      0.08     0.18     0.50 1.00     3284     2939
# 
# Family Specific Parameters: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     0.87      0.06     0.76     1.00 1.00     3580     2839
# hu        0.30      0.04     0.24     0.38 1.00     3605     2767


EDIT: Apparently the brms model works with right-censored data, but not left-censored data. See below. @paul.buerkner any thoughts?


# lognormal, right-censored hurdle model -------------------------------

n <- 500
mu <- 0
sigma <- 1
p <- .2 #probability of zero. 
U <- 3 #lower bound
data <- tibble(y_uncens = ifelse(rbinom(n, 1, 1-p) == 1, 
                                 rlnorm(n, mu, sigma),
                                 0)) %>% 
  mutate(y_cens = ifelse(y_uncens > U, U, y_uncens), 
         censored = ifelse(y_uncens > U, 'right', 'none'))
print(data, n = 20) #y_uncens = before censoring, y_cens = after censoring

hist(data$y_uncens, breaks = n/10)
abline(v= U, col = 'red', lwd = 2)


# hurdle model with right-censored data is ok
m4 <- brm(y_cens | cens(censored) ~ 1, family = hurdle_lognormal(), data = data) 
m4; mu; sigma; p 
# Population-Level Effects: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# Intercept    -0.04      0.05    -0.14     0.06 1.00     4162     2436
# 
# Family Specific Parameters: 
#   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     0.96      0.04     0.89     1.04 1.00     3851     2819
# hu        0.22      0.02     0.18     0.26 1.00     4193     3168
# 
# Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
# and Tail_ESS are effective sample size measures, and Rhat is the potential
# scale reduction factor on split chains (at convergence, Rhat = 1).
# [1] 0
# [1] 1
# [1] 0.2

EDIT: I played with the function below and was able to get the left-censored hurdle model to run properly. @paul.buerkner I can open up an issue on the brms github if you want me to.

real hurdle_lognormal_lcdf(real y, real mu, real sigma, real hu) {
    return bernoulli_lpmf(0 | hu) + lognormal_lcdf(y | mu, sigma);
  }

Thanks! Yes, please open an issue to suggest a fix for the issue you found.