Weighted Beta-Binomial Bayesian model

I have been trying to fit a weighted beta-binomial Bayesian model, that is, incorporating survey weights for each individual in the likelihood estimation, and combine it with the priors to generate posterior distribution. However, I searched high and low on the internet, but haven’t found the possibility of implementing it directly using an existing function in R. I somehow managed to develop a draft R code and have tested the code and it worked well. But I am not sure if this code is 100% correct. Could someone help to take a look at the following example code? That would be much appreciated. :D


library(rstan)

# generate dataset
set.seed(123)
disease_status <- c(rep(1, 300), rep(0, 700))
survey_weights <- sample(1:50, 1000, replace = TRUE)
data <- data.frame( disease_status, survey_weights )

Beta_Binomial_model  <- "
data {
  int<lower=0> n; // number of observations
  int<lower=0, upper=1> y[n]; // outcomes
  vector<lower=0>[n] w; // weights
  real<lower=0> alpha; // prior hyperparameter
  real<lower=0> beta; // prior hyperparameter
}

parameters {
  real<lower=0, upper=1> pi; // probability of success
}

model {
  pi ~ beta(alpha, beta); // prior
  
  for (i in 1:n) {
    target += w[i] * (y[i] * log(pi) + (1 - y[i]) * log(1 - pi));
  }
}
"

# data list to pass to Stan
data_list <- list(
  n = nrow(data),
  y = disease_status,
  w = survey_weights,
  alpha = 1, # specify priors
  beta = 1 # specify priors
)

# run the Stan model
fit <- stan(
  model_code = Beta_Binomial_model,
  data = data_list,
  iter = 2000,
  chains = 4
)
1 Like

Could someone help? Thanks in advance!

There’s a good chance that somebody will be able to help, but you should probably expect it to take several days before the right people on this forum for any particular question have a chance to take a look and write a response.

Your implementation looks like a correct implementation of a weighted log-likelihood, with two caveats:

  • Weights in a glm context are not really Bayesian (unless they are non-negative integers representing identical samples, but in that case for you this “weighting” would really just be an implementation of a binomial sufficient statistic). If the weights are non-integer, this stan program does indeed yield a weighted likelihood multiplied by a prior, though not necessarily a valid Bayesian posterior. Some discussion here https://groups.google.com/g/stan-dev/c/5pJdH72hoM8

  • This is not the model commonly referred to as beta-binomial. This is binomial (Bernoulli) regression with an identity link and a beta prior. In what is commonly referred to as the beta-binomial, the linear predictor (in this case the intercept) predicts the mean of a beta distribution, the scale of the beta is a fitted parameter, and the data are sampled from a beta-binomial mixture parameterized by the mean and scale of the beta.

Note that you can write (y[i] * log(pi) + (1 - y[i]) * log(1 - pi)) as bernoulli_lpmf(y[i] | pi).

2 Likes

I’m not really familiar with incorporating survey weights in analyses, but I don’t see a reason why this particular model couldn’t be implemented in brms, which would save you from having to wonder whether the Stan code you come up with is correct. It seems like this would just be:

fit <- brm(y  | weights(weights) + trials(...) ~ 1,
           family = beta_binomial(link = "logit", link_phi = "log"),
           ...)

Also, just as another discussion of survey weights in Stan, this thread from the forum may have useful information for you: Survey weights in brms/stan - Simulation based on design effect, feedback sought!

2 Likes

This brms version is what is commonly called the beta-binomial model. The OP’s model is a regular binomial regression that is intercept-only and has a beta prior on the identity-scale intercept. The OP’s model can be possible in brms, but would require specifying a binomial likelihood and then a bit of non-standard tinkering to get the identity-scale beta prior (could be done by specifying prior bounds and an identity link; could also be done on the logit scale by passing a logistic(0,1) prior as the Jacobian adjustment and then passing the desired beta prior via stanvars).

Ah, I see. I should’ve looked at the model code closer before commenting. So, the model is closer to:

fit <- brm(y | weights(weights) ~ 1,
           family = bernoulli(link = "identity"),
           prior = prior("beta(2, 2)", class = "Intercept"),
           ...)

Maybe this is a reflection of the modeling problem being outside my usual area of work, but I don’t see why this isn’t just a logistic regression with sampling weights?

Was the beta prior the reason for estimating the model on the identity scale instead of the logit?

Thank you very much, @jsocolar and @wgoette, for your helpful insights. :) Apologies for not specifying my goal of analysis in the previous hypothetical example; to add, I aimed to estimate the proportion of disease in the sample, taking into account the survey weights.

I have read through the email conversations suggested by @jsocolar and the discussion suggested by @wgoette as well as several other discussions about survey weights, linked here: What are the "weights" in rstanarm; Survey weighted regression. It now seems clear to me that including survey weights can produce overconfident “posterior distribution”, which is not actually based on the observed data. Below, I summarised three slightly different Bayesian approaches to estimate the proportion of disease in the sample, and they all produced very similar overconfident results.

library(rstan)
library(rstanarm)

# generate dataset
{
  set.seed(123)
  disease_status <- c(rep(1, 300), rep(0, 700))
  survey_weights <- sample(1:50, 1000, replace = TRUE) # original survey weights
  data <- data.frame( disease_status, survey_weights )
}

nrow(data) # sample size = 1,000
sum(data$survey_weights) # no. of population = 25,286

# Notes: in this example, survey weights indicates the no. of individuals in the population that 
# each individual in the sample represent; however, this does NOT mean that the individuals represented by the same
# unit in the sample would have the same outcome (i.e., disease status) as the sample unit;
# in fact, their outcomes are unknown because they are NOT observed.
# (sorry! :( English is not my native language, but I hope this is well articulated)

# In the following, I am going to try three different Bayesian methods to estimate the proportion of disease in the sample,
# while taking into account survey weights. I will try to illustrate the problems with survey weights in Bayesian analysis.


########## METHOD 1 ##########

# Binomial model + beta prior (flat prior)
# directly incorporate survey weights in the likelihood estimation

weighted_sum_disease <- sum(data$disease_status * data$survey_weights)
total_weighted_population <- sum(data$survey_weights)

alpha_posterior <- 1 + weighted_sum_disease
beta_posterior <- 1 + total_weighted_population - weighted_sum_disease

mean_posterior <- alpha_posterior / (alpha_posterior + beta_posterior)
lower_bound <- qbeta(0.025, alpha_posterior, beta_posterior)
upper_bound <- qbeta(0.975, alpha_posterior, beta_posterior)

first_method <- round(c(mean_posterior, lower_bound, upper_bound), 6)


########## METHOD 2 ##########

# Bernoulli model + a flat prior
# directly incorporate survey weights in the likelihood estimation

Binomial_beta_model  <-
"data {
  int<lower=0> n; // number of observations
  int<lower=0, upper=1> y[n]; // outcomes
  vector<lower=0>[n] w; // weights
  real<lower=0> alpha; // prior hyperparameter
  real<lower=0> beta; // prior hyperparameter
}

parameters {
  real<lower=0, upper=1> pi; // probability of success
}

model {
  pi ~ beta(alpha, beta); // prior
  
  for (i in 1:n) {
    target += w[i] * bernoulli_lpmf(y[i] | pi);
  }
}
"

# data list to pass to Stan
data_list <- list(
  n = nrow(data),
  y = data$disease_status,
  w = data$survey_weights,
  alpha = 1, # specify flat prior
  beta = 1 # specify flat prior
)

# run the Stan model
fit1 <- stan(
  model_code = Binomial_beta_model,
  data = data_list,
  iter = 2000,
  chains = 4
)

# summary of the fit
print( fit1 )

# extract posterior samples for the parameter 'pi'
posterior_samples_fit1 <- rstan::extract( fit1, pars = "pi" )$pi

second_method <- round( c(mean(posterior_samples_fit1),
                          quantile(posterior_samples_fit1, probs = 0.025),
                          quantile(posterior_samples_fit1, probs = 0.975)), 6)


########## METHOD 3 ##########

# logistic regression model with only intercept
# incorporate survey weights
fit2 <- stan_glm( disease_status ~ 1, 
                  family = binomial(link = "logit"),
                  weights = survey_weights,
                  prior_intercept = NULL,
                  iter = 3000,
                  data = data,
                  seed = 123 )

print(summary(fit2))

posterior_samples_fit2 <- as.matrix(fit2)
mean_posterior_log_odds <- mean(posterior_samples_fit2)
credible_interval_log_odds <- quantile(posterior_samples_fit2, c(0.025, 0.975))

mean_probability <- exp(mean_posterior_log_odds) / (1 + exp(mean_posterior_log_odds))
lower_probability <- exp(credible_interval_log_odds[1]) / (1 + exp(credible_interval_log_odds[1]))
upper_probability <- exp(credible_interval_log_odds[2]) / (1 + exp(credible_interval_log_odds[2]))

third_method <- round(c(mean_probability, lower_probability, upper_probability), 6)


########## Summarize results across methods ########## 

combined_matrix <- rbind(first_method, second_method, third_method)
rownames(combined_matrix) <- c("First Method", "Second Method", "Third Method")
colnames(combined_matrix) <- c("Mean", "Lower 95% PI", "Upper 95% PI")

print(combined_matrix) (a screenshot of the output is attached below if you don't want to run the code)

# as shown in the results, the 95% central posterior interval is super narrow
# this seems to be because of the calculation was based on the weighted data,
# majority of which is NOT actually observed

I have two follow-up questions: are the three methods valid for estimating the proportion of disease in this case? if survey weight is not included, which method is the best way to estimate the proportion of disease? It seems to me that all three methods are equivalent.

(I am not a statistician and I mainly learn statistics by myself from reading and discussion; so sometimes I can ask extremely “stupid” questions :) )

From the discussions linked above, many people suggested using multilevel regression and post-stratification (MRP). I will update the results using this method later here.

Your sample size is 1000 but it looks like you’ve calculated the weights such that your implied sample size is 25,286. The first level of overconfidence should be reduced if you can change the weights so they are proportionally the same relative to one another, but sum to 1000 (your actual sample size). A second level of overconfidence could be reduced by my suggested approach which is to calculate the design effect for your weights, and then divide the weights by the design effect.

None of this gets around the overarching critiques of weights not being ‘Bayesian’ but at a practical level it can produce some relatively valid seeming rough and ready estimates.

1 Like

Hi, in your previous post, you indicated using MRP. Do you know how MRP can be implemented in this hypothetical example, where aiming to estimate prevalence of a disease with survey weight for each sampled individual? Thanks in advance for your help.

Yes, you’re right about the implied sample size of 25,286. I used this approach to illustrate the overconfident posterior interval from the Bayesian approach.