Binomial model with number of trials as a variable

Hi,

Apologies if I don’t use the correct terminology. I’m trying to build a multi-level model where the number of trials is a variable. Probably easier to explain in formula:

Y_i \sim Binom(n_i,p_i) \\ n_i \sim Poisson(\lambda_i) \\ log(\lambda_i) = \beta_i + offset(logt_i) \\ logit(p_i) = \alpha_i \\ \alpha_i \sim Normal(0,1) \\ \beta_i \sim Normal(0,1)

Code to create test data:

fdata = data.frame(t = runif(100, min = 0 , max = 15))
fdata$log_t = log(fdata$t)
fdata$n= rpois(100 , fdata$t * rnorm(100 , mean = 3, sd =  1)  )
fdata$y= rbinom(n = 100 , size = fdata$n , prob = 0.8 )

I’ve tried this code:


b_mod <-
  brm(data = fdata, 
      family = binomial,
      bf(y | trials(n) ~ 1 , 
         n ~ 1 + offset(log_t) , nl = TRUE), 
      prior =  c(prior(normal(0, 1), class = Intercept),
                 prior(normal(0, 1), nlpar = b)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11)

But I get the following error:

“Error: The parameter ‘n’ is not a valid distributional or non-linear parameter. Did you forget to set ‘nl = TRUE’?”
which makes me think that I can’t have the number of trials as part of the model ?

Stan does not support sampling discrete parameters, but in many cases there is a workaround to fit a model with them in Stan by marginalizing out the discrete parameters. In your case, this would involve marginalizing out n in your full probability model to leave only continuous parameters which Stan can handle.

Fortunately, marginalization of a Poisson-distributed count in a Binomial model is fairly straightforward, see this Cross Validated question for more details.

To sum it up, the model goes from:

Y_i \sim \textrm{Binomial}(n_i, p_i)
n_i \sim \textrm{Poisson}(\lambda_i)

To, after marginalization:

Y_i \sim \textrm{Poisson}(\lambda_i p_i)

Updating your model with this change will allow you to use Stan to sample it, and then you should be able to recover posterior expectations on n_i by sampling from the posterior of \lambda_i that you’d get from the updated model.

2 Likes

wow, what a quick answer. And exactly solves my problem, I’m embarassed :< Thank you.

1 Like

Thought I would share my solution. I couldn’t seem to get it to work in brms so I just wrote the stan script instead. Here’s what I came up with:


data {
  int<lower=0> N; //number of observations
  int n[N];
  vector[N] offsets; // time offset
  int Y[N];

}

parameters {
  real<lower=0> lambda;
  real<lower=0, upper = 1> prob_success;

}

transformed parameters {
}

model {
  vector[N] lambda_offset;
  lambda_offset = lambda + offsets;
  
  target += binomial_lpmf(Y| n, prob_success);
  
  target += poisson_log_lpmf(Y| lambda_offset + log(prob_success)  );
}

generated quantities {
  // actual population-level intercept
  real final_pois = lambda + log(prob_success);
}


I think there are a couple of subtleties here:

@svonpfefer, if I understand correctly, you want to treat n as observed. At this point, your model breaks apart into two completely separate models, a Poisson model for n and a binomial model for y. You can fit them separately in two calls to brms. The only reason to fit them together would be if the binomial probabilities p and Poisson rates lambda are not modeled as independent, but rather as correlated (for example via a random effect that is multivariate normal on the link scales). But that doesn’t seem to be what you’re up to (if you want, though, I believe it can be achieved through the multivariate response syntax of brms).

This might be beside the point now, but this approach generally won’t yield inference about lambda and p except if lambda and/or p have very strong priors. Otherwise lambda and p are not separately identified, and you will get good inference on their product but won’t be able to recover useful inference on either quantity in isolation. (Technically we can still get separate inference if we have tons of data and use different sets of covariates onp and lambda, or if we have tons and tons of data, specify different link functions (e.g. logit and log), and feel very, very confident that our link functions capture the true generative process, but this is not the common applied modeling scenario).

In case it’s of interest, there is a class of models known in ecology as N-mixture models that get separately identified inference on lambda and p by collecting multiple measurements of y that are known a priori to derive from the same value of n. However, in this case we cannot leverage the fact that poisson-binomial marginalizes to poisson, and instead we need to do the marginalization over n up to some upper bound chosen to be large enough to a priori include all of the important probability mass in our sum.

4 Likes

Hi @jsocolar, thank you for your interest in my problem. I’m actually building up to a case where p and \lambda might be correlated, but for the moment treating them as independent.

I might have a look at N-mixture models, fortunately for my problem the value of n has a finite upper bound (<5) so that shouldn’t be a problem.