Thank you very much, @jsocolar and @wgoette, for your helpful insights. :) Apologies for not specifying my goal of analysis in the previous hypothetical example; to add, I aimed to estimate the proportion of disease in the sample, taking into account the survey weights.
I have read through the email conversations suggested by @jsocolar and the discussion suggested by @wgoette as well as several other discussions about survey weights, linked here: What are the "weights" in rstanarm; Survey weighted regression. It now seems clear to me that including survey weights can produce overconfident “posterior distribution”, which is not actually based on the observed data. Below, I summarised three slightly different Bayesian approaches to estimate the proportion of disease in the sample, and they all produced very similar overconfident results.
library(rstan)
library(rstanarm)
# generate dataset
{
set.seed(123)
disease_status <- c(rep(1, 300), rep(0, 700))
survey_weights <- sample(1:50, 1000, replace = TRUE) # original survey weights
data <- data.frame( disease_status, survey_weights )
}
nrow(data) # sample size = 1,000
sum(data$survey_weights) # no. of population = 25,286
# Notes: in this example, survey weights indicates the no. of individuals in the population that
# each individual in the sample represent; however, this does NOT mean that the individuals represented by the same
# unit in the sample would have the same outcome (i.e., disease status) as the sample unit;
# in fact, their outcomes are unknown because they are NOT observed.
# (sorry! :( English is not my native language, but I hope this is well articulated)
# In the following, I am going to try three different Bayesian methods to estimate the proportion of disease in the sample,
# while taking into account survey weights. I will try to illustrate the problems with survey weights in Bayesian analysis.
########## METHOD 1 ##########
# Binomial model + beta prior (flat prior)
# directly incorporate survey weights in the likelihood estimation
weighted_sum_disease <- sum(data$disease_status * data$survey_weights)
total_weighted_population <- sum(data$survey_weights)
alpha_posterior <- 1 + weighted_sum_disease
beta_posterior <- 1 + total_weighted_population - weighted_sum_disease
mean_posterior <- alpha_posterior / (alpha_posterior + beta_posterior)
lower_bound <- qbeta(0.025, alpha_posterior, beta_posterior)
upper_bound <- qbeta(0.975, alpha_posterior, beta_posterior)
first_method <- round(c(mean_posterior, lower_bound, upper_bound), 6)
########## METHOD 2 ##########
# Bernoulli model + a flat prior
# directly incorporate survey weights in the likelihood estimation
Binomial_beta_model <-
"data {
int<lower=0> n; // number of observations
int<lower=0, upper=1> y[n]; // outcomes
vector<lower=0>[n] w; // weights
real<lower=0> alpha; // prior hyperparameter
real<lower=0> beta; // prior hyperparameter
}
parameters {
real<lower=0, upper=1> pi; // probability of success
}
model {
pi ~ beta(alpha, beta); // prior
for (i in 1:n) {
target += w[i] * bernoulli_lpmf(y[i] | pi);
}
}
"
# data list to pass to Stan
data_list <- list(
n = nrow(data),
y = data$disease_status,
w = data$survey_weights,
alpha = 1, # specify flat prior
beta = 1 # specify flat prior
)
# run the Stan model
fit1 <- stan(
model_code = Binomial_beta_model,
data = data_list,
iter = 2000,
chains = 4
)
# summary of the fit
print( fit1 )
# extract posterior samples for the parameter 'pi'
posterior_samples_fit1 <- rstan::extract( fit1, pars = "pi" )$pi
second_method <- round( c(mean(posterior_samples_fit1),
quantile(posterior_samples_fit1, probs = 0.025),
quantile(posterior_samples_fit1, probs = 0.975)), 6)
########## METHOD 3 ##########
# logistic regression model with only intercept
# incorporate survey weights
fit2 <- stan_glm( disease_status ~ 1,
family = binomial(link = "logit"),
weights = survey_weights,
prior_intercept = NULL,
iter = 3000,
data = data,
seed = 123 )
print(summary(fit2))
posterior_samples_fit2 <- as.matrix(fit2)
mean_posterior_log_odds <- mean(posterior_samples_fit2)
credible_interval_log_odds <- quantile(posterior_samples_fit2, c(0.025, 0.975))
mean_probability <- exp(mean_posterior_log_odds) / (1 + exp(mean_posterior_log_odds))
lower_probability <- exp(credible_interval_log_odds[1]) / (1 + exp(credible_interval_log_odds[1]))
upper_probability <- exp(credible_interval_log_odds[2]) / (1 + exp(credible_interval_log_odds[2]))
third_method <- round(c(mean_probability, lower_probability, upper_probability), 6)
########## Summarize results across methods ##########
combined_matrix <- rbind(first_method, second_method, third_method)
rownames(combined_matrix) <- c("First Method", "Second Method", "Third Method")
colnames(combined_matrix) <- c("Mean", "Lower 95% PI", "Upper 95% PI")
print(combined_matrix) (a screenshot of the output is attached below if you don't want to run the code)
# as shown in the results, the 95% central posterior interval is super narrow
# this seems to be because of the calculation was based on the weighted data,
# majority of which is NOT actually observed
I have two follow-up questions: are the three methods valid for estimating the proportion of disease in this case? if survey weight is not included, which method is the best way to estimate the proportion of disease? It seems to me that all three methods are equivalent.
(I am not a statistician and I mainly learn statistics by myself from reading and discussion; so sometimes I can ask extremely “stupid” questions :) )
From the discussions linked above, many people suggested using multilevel regression and post-stratification (MRP). I will update the results using this method later here.