Binomial regression when number of trials is right censored

wpetry · May 6, 2022, 10:01pm

Suppose you sow a known number of seeds—0 to 100—onto patches of ground and observe how many seedlings germinate. Simultaneously, additional seeds blow onto your study patches, causing the number of germinants to exceed the number of seeds sown in some cases. If we assume that the rate of background seed addition and germination rates are constant across patches, I think it ought to be possible to recover the germination rate, especially if background seed addition is fairly low compared to the range of seeds added. Is it possible to fit such a model with brms? If so, what is the correct use of aterms needed to specify it?

Relates to Estimating the binomial rate when number of trials is uncertain, but seeks a brms solution.

Synthetic data and attempted solution

I’m just spitballing here…

# set parameters
pr_germ <- 0.3
mu_bkgd <- 5
disp_bkgd <- 3

# simulate data
set.seed(356456)
dat <- data.frame(seeds_added = rep(c(0L, 5L, 10L, 50L, 100L), each = 5), # sown seeds
                  seeds_bkgd = rnbinom(n = 25, mu = mu_bkgd, size = disp_bkgd)) # unknown seeds
dat$seeds_total <- dat$seeds_added + dat$seeds_bkgd
dat$n_germ <- rbinom(n = 25, size = dat$seeds_total, prob = pr_germ)  # number of germinants
dat$seeds_est <- dat$seeds_added  # lower bound of seeds_total

obs_dat <- dat[, c("seeds_added", "n_germ", "seeds_est")]  # we observe this

# fit model
mod <- brm(bf(n_germ | trials(seeds_est) ~ 1, family = binomial())+
            bf(seeds_est | cens("right") ~ 1 + seeds_added, family = negbinomial())+
            set_rescor(FALSE),
            # b_seedsest_seeds_added is fixed to 1
            # (i.e., each seed added increments the total seed by 1)
            prior = prior(constant(1), class = "b", coef = "seeds_added", resp = "seedsest"),
            data = obs_dat)

This produces an error because the number of successes (germinants) exceeds the number of trials in some replicates.
Error: Number of trials is smaller than the number of events.

Operating System: MacOS 12.3
brms Version: 2.17.0

jsocolar · May 7, 2022, 8:11am

I might have thought it would be reasonable to assume that the number of background seeds coming in is Poisson distributed, but your negative binomial approach should be similar in implementation. Your example also assumes that the germination rate is constant. I think you might be able to relax this assumption eventually if desired, especially if you’re willing to assume that the variation in the germination of the “background” seeds mirrors variation in the germination rate of the added seeds. In general, both a higher germination rate and a higher rate of background seed deposition will increase the observed counts. Fortunately, two features of the data should allow you to distinguish between the two. First, the binomial sampling variation might be distinguishable from Poisson (or negative binomial) variation. (But caution: when the number of trials gets large binomial distributions converge towards Poisson distributions.) But more powerfully, you have known variation in the number of binomial trials which we expect to be independent of any variation in the background seed deposition rate. Thus, the model is y = a + b where a \sim binomial(p, k) and b \sim Poisson(\lambda). (I’ll use the Poisson in this example, but it can be replaced with negbin).
On the Poisson side of the model the germination rate is completely non-identified from the background deposition rate.

What we need to do to fit this model exactly is to marginalize over all of the possible combinations of a and b. The number of such combinations is y + 1, ranging from (0, y) to (y, 0), which probably isn’t prohibitive for marginalization. I don’t think you can achieve this in brms except via a custom family, so it’ll require a bit of Stan code to get this running. It’s possible that you can get some kind of approximation solution in brms, but to fit this model exactly using Stan I think you’ll need to resort to explicit hand-coded marginalization.

wpetry · May 10, 2022, 9:18pm

Thanks @jsocolar! Searching based on the ideas in your response, I found a CrossValidated post that addresses this issue: probability - Marginalizing a Poisson-distributed count parameter in a Binomial Distribution - Cross Validated

The marginalization for the Poisson case is very elegant. Here’s some Stan code that works for simulated data:

data {
  int<lower=0> N;  // number of observations
  int<lower=0> nsow[N];
  int<lower=0> ngerm[N];
  real<lower=0> theta_a;
  real<lower=0> theta_b;
  real<lower=0> lambda_mu;
  real<lower=0> lambda_sd;
}

parameters {
  real<lower=0,upper=1> theta;   // probability of germination
  real<lower=0> lambda;          // rate of background seed addition
}

model {
  // priors
  theta ~ beta(theta_a, theta_b);
  lambda ~ normal(lambda_mu, lambda_sd);
  // model
  for(i in 1:N) {
   ngerm[i] ~ poisson((lambda + nsow[i]) .* theta); 
  }
}

I’m still working on the negative binomial case…

jsocolar · May 10, 2022, 10:17pm

Caution: a Poisson-distributed count parameter in a binomial distribution is not the same thing as the sum of a Poisson plus a binomial distribution! The former marginalizes to Poisson; the latter does not! To see intuitively that this is the case, suppose that \lambda is very small. Then the sum is approximately equal to the binomial term.

maxbiostat · May 10, 2022, 10:31pm

Negative binomial is the same.
If X is the observation and Y is the negative binomial latent random variable, here’s the maths:

\begin{align} \nonumber \operatorname{Pr}(X_i = x \mid \mu, \phi, p) &= \sum_{n = x}^\infty \operatorname{Pr}(X_i \mid Y_i = n, p)\operatorname{Pr}(Y_i = n \mid \mu, \phi), \\ \nonumber &= \sum_{n = x}^\infty \left(\frac{\phi}{\mu + \phi}\right)^\phi \binom{n + \phi -1}{n} \left( \frac{\mu}{\mu + \phi}\right)^n \binom{n}{x} p^x (1-p)^{n-x},\\ \nonumber &= \left(\frac{\phi}{\mu + \phi}\right)^\phi \left(\frac{p}{1-p}\right)^x \sum_{n = x}^\infty \binom{n}{x}\binom{n + \phi -1}{n} \left( \frac{\mu (1-p)}{\mu + \phi}\right)^n, \\ \nonumber &= \binom{x + \phi -1}{x} \left(\frac{\phi}{\mu + \phi}\right)^\phi \left( \frac{p}{1-p} \right)^x \left( \frac{\mu(1-p)}{\mu + \phi} \right)^x \left( \frac{p\mu +\phi}{\mu + \phi} \right)^{-(\phi + x)}, \\ &= \binom{x + \phi -1}{x} \left(\frac{\phi}{p\mu + \phi}\right)^\phi \left( \frac{p\mu}{p\mu + \phi} \right)^x. \end{align}

Thus the marginal distribution is a negative binomial with parameters p\mu and \phi.

Topic		Replies	Views
Very large credible intervals for odds ratio for logistic regression in brms Modeling fitting-issues	3	2755	August 22, 2019
Modeling with bernoulli or binomial distribution General brms	3	785	September 16, 2022
Settin priors for unknown data in zero-one inflated beta Modeling rstan , fitting-issues , specification , brms	6	356	February 27, 2024
'Trials ' missing in model description with brm brms	3	1929	September 14, 2023
Brms -- Zero inflated binomial -- trials -- Presence absence data Modeling brms	3	897	October 28, 2022

Binomial regression when number of trials is right censored

Synthetic data and attempted solution

Related topics