Predict beta part of beta-binomial model for new groups without number of trials


I am fitting a beta-Binomial model to describe a probability of success p given n successes from N trials as a function of explanatory variable sets X and Z describing the mean mu and variance phi of a beta distribution as follows (in JAGS):

  for (i in 1:nobs) {
    n[i] ~ dbinom(p[i], N[i])
    p[i] ~ dbeta(alpha[i], beta[i])
    alpha[i] <- mu[i] * phi[i]
    beta[i] <- (1 - mu[i]) * phi[i]
    mu[i] <- atan(mu_[i]) / pi + (1/2)
    mu_[i] <- mu_a_mn + mu_a_re[group[i]] + inprod(X[i,], mu_b)
    log(phi[i]) <- phi_[i]
    phi_[i] <- phi_a_mn + phi_a_re[group[i]] + inprod(Z[i, ], phi_b)

There are terms to allow for group-level effects in mu and phi and mu is Cauchit-transformed rather than the conventional logit-transformed to allow more flexibility to fit the perceived highly variable p.

Simulations suggest I can recover ok generating parameter estimates when fit in brms using the beta_binomial2 custom family (Define Custom Response Distributions with brms) as follows (in R):

  # data structure
  nyears <- 25
  ngroups <- 9
  nobs <- nyears * ngroups
  nx <- 3
  nz <- 1

  # parameters
  ## mu
  mu_a_mn <- 2
  mu_a_sd <- 1
  mu_a_re <- rnorm(ngroups, 0, mu_a_sd)
  mu_b <- c(1.11, -1.56, 1.2)
  ## phi
  phi_a_mn <- 3
  phi_a_sd <- 2
  phi_a_re <- rnorm(ngroups, 0, phi_a_sd)
  phi_b <- -1.5
  # simulate data
  groups <- rep(1:ngroups, each = nyears)
  years <- rep(1:nyears, times = ngroups)
  ## mu
  X <- sapply(rep(0, nx), rnorm, n = nobs, sd = 1)
  mu_ <- mu_a_mn + mu_a_re[groups] + X %*% mu_b
  mu <- as.numeric(atan(mu_) / pi + (1/2))
  ## phi
  Z <- sapply(rep(0, nz), rnorm, n = nobs, sd = 1)
  phi_ <- phi_a_mn + phi_a_re[groups] + Z %*% phi_b
  phi <- exp(as.numeric(phi_))
  ## Beta-distributed probability of success (p)
  alpha <- mu * phi
  beta <- (1 - mu) * phi
  p <- rbeta(nobs, alpha, beta)
  ## trial sizes (N)
  Nhat <- sample(x = seq(50, 2000, 25), size = ngroups)
  Nmat <- sapply(1:ngroups, function(v) rpois(nyears, Nhat[v]))
  N <- as.numeric(Nmat)
  ## number of successes (n)
  n <- rbinom(nobs, N, p)

  # fit using brms
  ## store data in data frame
  d <- data.frame(n, N, p, X, Z, groups, years)
  ## custom family: beta binomial mean and var with cauchit mu link
  beta_binomial2 <- custom_family('beta_binomial2',
                                  dpars = c('mu', 'phi'),
                                  links = c('cauchit', 'log'),
                                  lb = c(NA, 0),
                                  type = 'int', vars = 'vint1[n]')
  ## corresponding *lpmf and *rng based on beta_binomial in stan
  stan_funs <- '
    real beta_binomial2_lpmf(int y, real mu, real phi, int T) {
      return beta_binomial_lpmf(y | T, mu * phi, (1 - mu) * phi);
    int beta_binomial2_rng(real mu, real phi, int T) {
      return beta_binomial_rng(T, mu * phi, (1 - mu) * phi);
  ## put the stan functions in the functions block and add trials
  stanvars <- stanvar(scode = stan_funs, block = 'functions')

  ## fit
  brm_fit <- brm(bf(n | vint(N) ~ 1 + X1 + X2 + X3 + (1|groups),
                    phi ~ 1 + Z + (1|groups)),
                 data = d,
                 family = beta_binomial2, 
                 stanvars = stanvars,
                 chains = 3, cores = 3)
  ## print a summary

Now, I would to use this model to predict p for new groups where I do not have N (I do have n), i.e.:

  # prediction functions
  posterior_predict_beta_binomial2 <- function(i, prep, ...) {
    mu <- prep$dpars$mu[, i]
    phi <- prep$dpars$phi[, i]
    trials <- prep$data$vint1[i]
    beta_binomial2_rng(mu, phi, trials)

  # predict to new groups without N
  d1 <- d
  d1$groups <- factor(as.integer(d$groups) + ngroups)
  d1$N <- NA
  predict(brm_fit, newdata = d1, allow_new_levels = TRUE)
  # Error: 'vint' requires whole numbers as input.

Of course it breaks because it is predicting the Binomial part too and it is looking for integers in N and not the NAs I have given.

Is it possible to predict the beta part only using brms?
If not, is there any reason why I could not draw samples from the relevant posteriors to generate predictions manually?

Many thanks for any guidance,

For the posterior predict function, you could probably add some lines that use the beta distribution RNG if trials is NA.

if you want to predict the number of trials, I don’t think a (beta) binomial model is the right choice. The binomial or beta-binomial is the distribution of successes when trials are known. You might look into the negative binomial distribution, which (among other definitions) gives the number of trials until n successes.

1 Like

Hi Christopher-Peterson,

Thanks for your reply.

Please excuse my ignorance, but how might I use the beta distribution random number generator in the posterior predict function?

From my reading, it would seem that I could coded predictions in the generated quantities block of a Stan model, and JAGS code for this part might be:

  for (g in 1:npredgroups) {
    mu_a_re_tilde[g] ~ dnorm(0, mu_a_tau)
    phi_a_re_tilde[g] ~ dnorm(0, phi_a_tau)
  for (j in 1:npreds) {
    p_tilde[j] ~ dbeta(alpha_tilde[j], beta_tilde[j])
    alpha_tilde[j] <- mu_tilde[j] * phi_tilde[j]
    beta_tilde[j] <- (1 - mu_tilde[j]) * phi_tilde[j]
    mu_tilde[j] <- atan(mu_tilde_[j]) / pi + (1/2)
    mu_tilde_[j] <- mu_a_mn + mu_a_re_tilde[group_tilde[j]] + inprod(X_tilde[j,], mu_b)
    log(phi_tilde[j]) <- phi_tilde_[j]
    phi_tilde_[j] <- phi_a_mn + phi_a_re_tilde[group_tilde[j[] + inprod(Z[j, ], phi_b);

where npredrgroups is the number of groups in the prediction dataset, npreds is the number of observations in the prediction dataset, and X_tilde and Z_tilde are the explanatory variables for the predictions in the prediction dataset.

This all seems quite feasible (although I’d have to work on coding it up in Stan), but would it be possible in brms?

Thanks again,

Hi aammd,

You’re right that a binomial isn’t the right choice of model to predict a number. But, do you think it could be a reasonable approach to my specific problem? Further description: I have two datasets - one with the number of trials (“complete”) and one without the number of trials (“incomplete”) - and I plan to use the binomial model to estimate p as a function of explanatory variables in the complete dataset and use those functions to estimate p for the same explanatory variables in the incomplete dataset.

I will think about the negative binomial more.

Thanks for your thoughts,

It seems to me that this is a missing data problem! Rather than two datasets, do you perhaps have just one large dataset, where some of the trials are missing? If so, it should be possible to get a posterior distribution for both p and also the missing values of N_i, the number of trials for each response y_i. However I don’t know if this can be written with the mi() helper function in brms

I agree with @aammd that this looks more like a missing data problem. Other than completeness/incompleteness, do you have a good reason to expect p to be different between your datasets?

Let’s say that your complete trial numbers are N_{(c)}, complete counts are Y_{(c)}, and incomplete counts are Y_{(u)}. If you use a Poisson distribution to model N, then you could marginalize over the unknown N_{(u)} for a model like this:

\begin{aligned}N_{(c)} &\sim \text{Poisson}(\lambda)\\ p &\sim \text{Beta}(\mu, \sigma)\\ Y_{(c)} &\sim \text{Binom}(p, N_{(c)})\\ Y_{(u)} &\sim \text{Poisson}(p \lambda)\\ \end{aligned}

The \mu, \sigma, and \lambda parameters could all be functions of regression coefficients and the data, of course. If you want to estimate N_{(u)}, then you could use the Poisson rng in the generated quantities (without multiplying by p). I think this may need to be written directly in Stan, however.

I’m not sure if that marginalization trick works with a negative binomial instead of a Poisson, though.

Thanks, aammd. In fact, it is more that I would like to parameterise a model with the complete dataset, which is assumed to be a representative sample of the incomplete dataset, and then use that model to predict the number of trials for the incomplete dataset.

If the data were missing, then your proposed solution using the mi() helper function could be very interesting.

Thanks, Christopher-Peterson. RE missing data: please see my response to @aammd. I don’t expect p to be especially different between the complete and incomplete datasets (the complete cases are assumed to be representative of the incomplete cases), but the reason for treating this as a binomial problem was because I have covariates for all complete and incomplete cases that I will use to describe p.

I like very much your proposed solution to use a binomial-Poisson coupling, and I was originally using this approach. However, I have since moved to a beta-binomial, in part because I didn’t feel confident expressing \lambda as a function of covariates (not that I need to).

Update: I followed @aammd’s suggestion to look into the negative binomial as way of predicting the number of trials directly: all the parameterisations of the negative binomial I could find available in stan (and so brms) would require me to rethink my problem in terms of counts (albeit it mean counts with overdispersion). As I mentioned before, my reason for tackling this as a binomial problem was to allow me to describe p from covariates. In my reading, I found reference to a “beta-negative binomial” that appears to be a compound of a beta and a negative binomial distribution. See Beta negative binomial distribution - Wikipedia and extraDistr/beta-negative-binomial-distribution.R at master · twolodzko/extraDistr · GitHub. While I like the look of this option, I have rarely seen it used… I will continue my reading and thinking.

Many thanks again for your replies.