Hi Guido, thanks for anwering,
Basic summary
Although I am not an expert, let me try to expand the details for you. Briefly, DHS surveys are based on cross-sectional representative samples of individuals and household level evaluation indicators on health, nutrition, and HIV serostatus over time. These surveys use a two-stage sampling design through a set of defined survey locations (primary sample units or PSU) statistically weighted to control for some strata. They are very common in nhanes (CDC), as well as school surveys (PISA).
The weight is important for me for two reasons. First, I am interested in making inferences about the population, not the sample. Second, because, in national datasets, certain subpopulations (e.g., Hispanic, urban/rural, or Asian students) are oversampled to ensure that the sample size is large enough for subgroup analysis. Without using weights, those groups may be overrepresented (or underrepresented) leading to erratic SE for covariates.
I normally perform this pipeline using the survey package (Lumley et al.) in a frequentist approach as:
library(survey)
set.seed(1999)
x â data.frame( psu_variable = sample( 1:3 , 200 , replace = TRUE ), strata_variable = sample( 1:10 , 200 , replace = TRUE ) , weight_variable = sample( 1:2 , 200 , replace = TRUE ) , some_variable = sample( 1:50 , 200 , replace = TRUE ) , outcome=sample( 0:1 , 200 , replace = TRUE ))
letâs suppose, I have a DHS for South Africa
# x
x[ , 'country_name' ] <- 'ZAF'
- design creation:
x_des <- svydesign( ~ psu_variable , strata = ~ strata_variable , data = x , weights = ~ weight_variable , nest = TRUE )
- Descriptive statistics for any variable
svymean( outcome~some_variable , x_des )
- Suppose a logistic regression based on the individual-level covariate
m0.w.svy<-svyglm(outcome~some_variable, design=x_des , family=quasibinomial())
summary(m0.w.svy)
- Finally, I post-stratify to reflect the country relative population (here)
Since I have several covariates that can complement the analysis but they are at Admin1-level resolution (e.g. Malaria incidence, Friction mapsâŚ). My ultimate goal is to add them properly into a brms stack taking care of standard errors and the survey design.
Something like this:
brm(outcome~covariate1+covariate2+malaria+accessibility
+(malaria+accessibilityâŚ|Admin1),
data = your_dataframe,
weights? design=(psu, strata)
prior = bprior,
family = bernoulli(link = âlogitâ),chains=4,warmup=500,
iter = 2000, cores = 4,seed = 2020,inits = â0â,
control = list(adapt_delta = 0.8,max_treedepth=10))
Thanks in advance,
Esteban