Accounting for selection bias in a survival model


The main goal of this study is to analyse the factors that influence the survival of oak seedlings.

We planted 900 acorns in 40 plots. The acorns were all collected from oak woodlands. Each planted acorn was considered healthy and viable during lab tests. Acorns were randomly assigned to each plot. We visited the plots periodically and recorded which acorns had germinated and followed the seedlings over time recording the time of death. We also collected several environmental covariates.

Out of 900 acorns, 400 never germinated. Low germination rates are common, but in this case the acorns that did not germinate seem to be concentrated in 5-7 plots. This seems to suggest that some environmental factors in these 5-7 plots may be reducing the probability of germination.

The initial goal of the study didn’t include assessing germination rates, but I think that we ignore it we will be introducing selection bias in our study. That is, I think we need to condition survival on germination, but I’m not sure how to proceed.

I guess the simplest option is to fit two separate models, a germination model and a survival model.

However, I wondering if it would make sense to stitch both models together like this:

Germination ~ Bernouli(pG)

pG ← logit(a+b1x1…+bn.xn)

Survival ~ exp(lambda * G)

If G = 0, then S = 0

Assuming the above model makes sense, what would we gain by fitting it like this compared to fitting two separate models?



This seems to me like a good use case for a hurdle model. The Stan documentation has a really nice explanation here.

Hurdle models are among the many useful models that can be fit using #interfaces:brms . Here is some quick simulation code to create data like yours and fit a straightforward hurdle model to it:

n_plots <- 40
n_acorns_per_plot <- 25

p_plot_logit <- tidybayes::rstudent_t(n_plots, df = 4, mu = 1, sigma = 1.7)



acorn_plots <- expand.grid(acorn_num = 1:n_acorns_per_plot,
            plot_num = 1:n_plots)

acorn_prob <- transform(acorn_plots, 
          plot_id = paste0("plot",plot_num),
          plot_germ = plogis(p_plot_logit[plot_num]))

germination <- transform(acorn_prob,
          germinated = rbinom(n_plots * n_acorns_per_plot, 1, plot_germ))

tapply(germination$germinated, germination$plot_id, mean)


days_lived <- rpois(n_plots * n_acorns_per_plot, 7)

survival_data <- transform(germination, life = germinated * days_lived)


survival_hurdle_poisson <- brm(bf(life ~ (1 | plot_id),
                               hu ~ (1 | plot_id),
                               family = hurdle_poisson(link = "log")),
                               data = survival_data)


posterior_germination <- tidybayes::spread_draws(survival_hurdle_poisson, b_hu_Intercept, r_plot_id__hu[plot_id,])


posterior_germination %>% 
  ungroup %>% 
  mutate(plot_id = forcats::fct_reorder(plot_id, r_plot_id__hu)) %>% 
  ggplot(aes(y = plot_id, x = r_plot_id__hu + b_hu_Intercept)) + stat_halfeye() + 
  geom_point(aes(y = plot_id, x = qlogis(1 - plot_germ)), colour = "red", data = survival_data)

comments / ideas

In order to get a few plots with very low germination, I used a Student T distribution for each plot’s baseline germination rate. You can see that the model regularizes the posterior estimates for each plot towards the average.

you might want to correlate the hurdle part (germination success or not) from the survival part (how many days the acorns stay alive) using the syntax (1 | id | plot_id)

I’m no expert in survival analysis! I feel like there’s something about the error generated by reducing a continuous lifespan into discrete days, but I’m not sure what.

If some of your seedlings were still alive when you ended the study, then you also need to treat your data as censored (again, something you can do in brms)

You can model variation in survivorship and germination with the same or different predictor variables.

Hope that helps!


Thanks. I was wondering what’s the advantage of having a hurdle model over having two separate models (a germination model and a survival model).

The approach with two separate models is a special case of the hurdle model when there is no crosstalk between the two models. In a hurdle model, however, you could allow for crosstalk. For example, maybe the random effect of plot on gemination is correlated with the random effect of plot on survival.

There are a couple of additional subtleties. If you don’t care about germination and just want inference on probabilities of survival conditional on germination, you might not need the germination model at all. You could just limit your data to the subset of seeds that germinated, and work exclusively with the survival model. Here, you potentially lose some information if (but only if) you the germination and survival probabilities are correlated, so that the germination data can usefully inform inference about the survival probabilities. At first brush, it seems plausible to me that germination and survival might be correlated (if they respond to similar environmental factors), uncorrelated (if for example germination failure is due to seed predators that don’t attack seedlings), or anticorrelated (if environmental conditions that impose stronger filters at the germination step select for individuals that are more vigorous and are likely to survive better once they’ve made it through the germination bottleneck).

1 Like

Thanks, that was very helpful. I definitely think that germination and survival are not independent, however this relationship can plausibly take many forms. I will have to keep thinking about this and do some more data exploration.