Specifying hurdle model with generated & derived quantities

Hi. It’s my first time posting here, and I’m just getting started using Stan. If someone can help me figure this out it would be a big help. I’m working on converting NIMBLE code to Stan.

I have a fairly simple two-step hurdle model of bird abundance. Model 1 models the probability that the species occurred during that observation, so it predicts an occurrence probability for each observation which I (stochastically) convert to a binary (0,1) estimate using occurred ~ dbern(p1). Then, Model 2 models abundance given occurrence, where count ~ dpois(lambda * occurred). So the models are connected in that way. Site-level variation is the biggest source of variation, so I include site-level random effects.

I can figure out converting this to Stan except for two things:

  1. The occurred ~ dbern(p1) piece that I mention. Occurred ends up being a generated quantity (I think? Or something like that?) and so I don’t know how to code it in a way for Stan to understand it’s generated but also intermediate between the two sub-models.

  2. My primary inferential objective is the derived quantity, which is at the bottom, and is the annual estimate of typical abundance, which is basically avg_abd[t] = lambda2[t] * p1[t]. I think this would go under transformed parameters? Is that correct? Or is this also a generated quantity?

I also am unsure how sum-to-zero random effects are handled in Stan.

Many thanks for any and all advice and guidance. Again, this would be a really big help.

# ---- Data ----

dataList <- list(
  year = df1$year,
  occurrence = df1$occurrence,
  count = df1$count

# ---- Constants ----

constList <- list(
  N1 = 8436,
  S = 30,
  nyears = 33,
  years = sort(unique(df1$year)),
  site = df1$site

# ---- Model code ----

code <- nimbleCode({
  # ---- Priors ----
  # Model 1: Occurrence 
  int_m1 ~ dnorm(0, 0.01)
  beta_m1_year ~ dnorm(0, 0.01)
  # random site intercepts (sum-to-zero)
  sigma_m1 ~ dunif(0, 10)
  for (s in 1:(S-1)) {
    eps_site_m1[s] ~ dnorm(0, sd = sigma_m1)
  eps_site_m1[S] <- -sum(eps_site_m1[1:(S-1)])
  # Model 2: Count
  int_m2 ~ dnorm(0, 0.001)
  beta_m2_year ~ dnorm(0, 0.001)
  # random site intercepts (sum-to-zero)
  sigma_m2 ~ dunif(0, 10)
  for (s in 1:(S-1)) {
    eps_site_m2[s] ~ dnorm(0, sd = sigma_m2)
  eps_site_m2[S] <- -sum(eps_site_m2[1:(S-1)])
  # ---- Likelihoods ----
  for(i in 1:N1) {
    # ---- MODEL 1 ---- 
    # Likelihood
    occurrence[i] ~ dbern(p1[i])
    logit(p1[i]) <- int_m1 + (beta_m1_year * year[i]) + eps_site_m1[site[i]]
    # Prediction from Model 1 to use in count model - this is a "generated quantity"?
    occurred[i] ~ dbern(p1[i])
    # ---- MODEL 2 ---- 
    # Likelihood
    count[i] ~ dpois(lambda2[i] * occurred[i])
    log(lambda2[i]) <- int_m2 + (beta_m2_year * year[i]) + eps_site_m2[site[i]]
  # ---- Derived quantities ----
  # Typical annual abundance
  for(t in 1:nyears){
    logit(p[t]) <- int_m1 + (beta_m1_year * years[t])
    log(lambda[t]) <- int_m2 + (beta_m2_year * years[t])
    avg_abd[t] <- p[t] * lambda[t]

This sounds more like a zero-inflated model (where a count of zero is possible even if the occurrence is 1) than a hurdle (where the count must be nonzero if occurrence is 1). In either case, you should look at the brms package, which can fit these sorts of models relatively easily. Even if you want to do it directly in Stan, I’d still recommend playing with brms to see the Stan code it generates.

Here is a vignette that has a zero inflated poisson; the hurdle syntax is basically the same.

Thanks for this! I know it’s pretty similar to a zero-inflated model. I would do this in brms, but I want to also have a third model that predicts the number of species on the checklist and then uses the predicted - observed length of the checklist as a covariate in both the occurrence and abundance models. And that’s not something I could do in brms but I could do in Stan, which is why I’m asking about how to get this going. I hope that extra explanation helps!

In that case, I’d start by setting up the first/second hurdle models in brms, generate the data & stan code from them, then modify it to get the third. The naming of variables in the Stan code that brms makes can be a bit weird, but figuring out what is what is relatively straightforward. I’d recommend defining your predicted - observed length in the transformed_parameters block (if you want to save it outside the model) or in the model block (if you don’t want to save it). It may be possible to work with the transformation directly in brms, but the syntax is tricky enough that it’s probably just easier to modify the Stan code directly.

Re: sum to zero: There have been some interesting discussions here about the implications of using this kind of transformation for random effects. It may be worth considering a soft centering approach instead of a hard sum-to-zero. If you want to go with it, there’s an example of the transform in the second code block here.

1 Like