Meta-analysis of adjusted prevalence estimates

I am planning a meta-analysis project, in which I will pool surveys that tested for the presence of a disease.

library(tidyverse)
library(brms)

d <- tribble(~study, ~cases, ~population,
             "A", 23, 4651,
             "B", 451, 76420,
             "C", 101, 12456,
             "D", 4, 391,
             "E", 66, 2318) %>%
  mutate(prev = cases/population)

We could model pooled prevalence like this:

m1 <- brm(cases  ~ 1 + (1 |study) + offset(log(population)),
          data = d,
          family = negbinomial())

But each study might additionally report a prevalence estimate weighted to account for sampling bias, and it would make sense to pool those estimates:

d <- d %>%
  mutate(adj_prev = c(0.0061, 0.0049, 0.086, 0.021, 0.035))

I am not sure how I could go about modelling the pooled estimate of adj_prev here whilst accounting for the size of the population in each study.

Thanks!

You could multiply the adjusted prevalence by the population size to obtain an adjusted number of cases, then use that value in place of the observed cases.

What’s missing from your current data and example is an estimate of the uncertainty in the estimated number of cases (observed or adjusted). You should also take the reported standard error from the adjusted prevalence and multiply by the population to obtain an estimated SE for the adjusted number of cases, then pass that into the meta-analysis model like this:

m2 <- brm(adj_cases | se(se_adj_cases)  ~ 1 + (1 |study) + offset(log(population)),
          data = d,
          family = negbinomial())
1 Like

Thanks so much - this is great!

Most studies I am looking at don’t report the standard error of the adjusted prevalence, but do report 95% confidence intervals. So I guess I could calculate the standard error like this.

library(tidyverse)
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.20.4). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#> 
#> Attaching package: 'brms'
#> The following object is masked from 'package:stats':
#> 
#>     ar

d <- tribble(~study, ~cases, ~population,
             "A", 23, 4651,
             "B", 451, 76420,
             "C", 101, 12456,
             "D", 4, 391,
             "E", 66, 2318) %>%
  mutate(prev = cases/population)

d <- d %>%
  mutate(adj_prev = c(0.0061, 0.0049, 0.0086, 0.021, 0.035),
         adj_prev_ci_lo = c(0.0023, 0.0035, 0.060, 0.003, 0.017),
         adj_prev_ci_hi = c(0.0062, 0.0063, 0.11, 0.067, 0.056))

#calculate the adjusted number of cases
d <- d %>%
  mutate(adj_cases = round(adj_prev*population))

#calcuate standard error for adjusted prevalence from 95%CI
d <- d %>%
  mutate(adj_se = (adj_prev_ci_hi-adj_prev_ci_lo)/3.92)

glimpse(d)
#> Rows: 5
#> Columns: 10
#> $ study          <chr> "A", "B", "C", "D", "E"
#> $ cases          <dbl> 23, 451, 101, 4, 66
#> $ population     <dbl> 4651, 76420, 12456, 391, 2318
#> $ prev           <dbl> 0.004945173, 0.005901596, 0.008108542, 0.010230179, 0.0…
#> $ adj_prev       <dbl> 0.0061, 0.0049, 0.0086, 0.0210, 0.0350
#> $ adj_prev_ci_lo <dbl> 0.0023, 0.0035, 0.0600, 0.0030, 0.0170
#> $ adj_prev_ci_hi <dbl> 0.0062, 0.0063, 0.1100, 0.0670, 0.0560
#> $ adj_cases      <dbl> 28, 374, 107, 8, 81
#> $ adj_se         <dbl> 0.0009948980, 0.0007142857, 0.0127551020, 0.0163265306,…

Created on 2024-04-24 with reprex v2.1.0

However, I am still struggling to run the meta-analysis regression model, as below:

#fit the meta-analysis model
m2 <- brm(adj_cases | se(adj_se)  ~ 1 + (1 |study) + offset(log(population)),
          data = d,
          family = negbinomial())
#> Error: Argument 'se' is not supported for family 'negbinomial(log)'.

Created on 2024-04-24 with reprex v2.1.0

Very grateful if you have suggestions for this - I am guessing will need a different distributional family…

Thanks,
Peter

Wouldn’t it make sense to just pool the adjusted prevalance estimates since the uncertainty should take the population size into account? It is a bit tricky, but I would suggest a beta regression for pooling. So, you need to figure out the Beta distribution parameters for each adjusted prevalence case, in the parameterization that brms uses.

This is the parameterization that brms uses: \mu=\frac{a}{a+b} (between 0 and 1), and \phi=a+b (>0). Or in reverse a=\mu \phi, b=\phi(1-\mu). \phi is interpretable as the population size, in your case an adjusted population size.

So, we need to turn the point estimate and the standard error/confidence interval into a mean (which is the point estimate) and into dispersion (\phi). If the interval is symmetrical and relies on the normal approximation we can use your formula for \sigma (standard deviation) i.e. the standard error.

Some other useful identities:
\sigma = \frac{\sqrt{\frac{\alpha \beta}{\alpha+\beta+1}}}{\alpha+\beta}=\sqrt{1-\mu} \sqrt{\frac{\mu}{\phi+1}}

From this follows:
\phi = \frac{\mu}{\sigma^2}-\left(\frac{\mu}{\sigma}\right)^2-1

Then the model would look something like this:

brm(formula = bf(mean ~ 1 + (1|study), phi ~ 0 + offset(derived_phi)), data = dataset, family = Beta(link_phi = "identity")) -> m1

EDIT:

An additional step that need to be done is to derive the mean for the logit-normal for each draw, since the random intercept is assumed to be normally distributed on the logit scale, and that distribution doesn’t have an analytical solution. We can do that with something like this:

# Function to calculate the mean of the logit-normal distribution
mean_logit_normal <- function(mu, sigma) {
  # Compute mean using numerical integration
  # This is the by the way the law of the unconscious statistician (LOTUS)
  # plogis is expit, i.e. inverse logit
  return(integrate(function(x) plogis(x) * dnorm(x, mu, sigma), lower = -Inf, upper = Inf)$value)
}

# Calculate the mean of the logit-normal distribution for each pair of mu and sigma
tibble(mu = m0 %>% as_draws_rvars() %>% pluck("Intercept") %>% draws_of() %>% as.vector(),
       sigma = m0 %>% as_draws_rvars() %>% pluck("sd_study__Intercept") %>% draws_of() %>% as.vector()) %>%
  mutate(mean_Y = map2_dbl(mu, sigma, mean_logit_normal, .progress = T)) -> 
  parameters

parameters$mean_Y %>% rvar()

parameters$mean_Y then contains the posterior draws for the pooled estimate.

Thanks so much! This looks really promising. Just want to check that I am understanding everything correctly, so have tried running this:

library(tidyverse)
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.21.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#> 
#> Attaching package: 'brms'
#> The following object is masked from 'package:stats':
#> 
#>     ar
library(posterior)
#> This is posterior version 1.5.0
#> 
#> Attaching package: 'posterior'
#> The following objects are masked from 'package:stats':
#> 
#>     mad, sd, var
#> The following objects are masked from 'package:base':
#> 
#>     %in%, match

d <- tribble(~study, ~cases, ~population, ~region,
             "A", 23, 4651, "AFR",
             "B", 451, 76420, "AFR",
             "C", 101, 12456, "WPR",
             "D", 4, 391, "WPR",
             "E", 66, 2318, "AFR") %>%
  mutate(prev = cases/population,
         prev100k = prev*100000)

d <- d %>%
  mutate(adj_prev = c(0.0061, 0.0049, 0.0086, 0.021, 0.035),
         adj_prev_ci_lo = c(0.0023, 0.0035, 0.060, 0.003, 0.017),
         adj_prev_ci_hi = c(0.0062, 0.0063, 0.11, 0.067, 0.056))

#calculate the adjusted number of cases
d <- d %>%
  mutate(adj_cases = round(adj_prev*population))

#calcuate standard error for adjusted prevalence from 95%CI
d <- d %>%
  mutate(adj_se = (adj_prev_ci_hi-adj_prev_ci_lo)/3.92)

glimpse(d)
#> Rows: 5
#> Columns: 11
#> $ study          <chr> "A", "B", "C", "D", "E"
#> $ cases          <dbl> 23, 451, 101, 4, 66
#> $ population     <dbl> 4651, 76420, 12456, 391, 2318
#> $ region         <chr> "AFR", "AFR", "WPR", "WPR", "AFR"
#> $ prev           <dbl> 0.004945173, 0.005901596, 0.008108542, 0.010230179, 0.0…
#> $ prev100k       <dbl> 494.5173, 590.1596, 810.8542, 1023.0179, 2847.2821
#> $ adj_prev       <dbl> 0.0061, 0.0049, 0.0086, 0.0210, 0.0350
#> $ adj_prev_ci_lo <dbl> 0.0023, 0.0035, 0.0600, 0.0030, 0.0170
#> $ adj_prev_ci_hi <dbl> 0.0062, 0.0063, 0.1100, 0.0670, 0.0560
#> $ adj_cases      <dbl> 28, 374, 107, 8, 81
#> $ adj_se         <dbl> 0.0009948980, 0.0007142857, 0.0127551020, 0.0163265306,…

#convert adjusted point estimate and standard error into mean and dispersion
d <- d %>%
  mutate(derived_phi = (adj_prev/adj_se^2) - (adj_prev/adj_se)^2 - 1)


#fit the meta-analysis model
m1<- brm(formula = bf(adj_prev ~ 1 + (1|study), phi ~ 0 + offset(derived_phi)), data = d, family = Beta(link_phi = "identity"))

summary(m1)
#> Warning: There were 9 divergent transitions after warmup. Increasing
#> adapt_delta above 0.8 may help. See
#> http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#>  Family: beta 
#>   Links: mu = logit; phi = identity 
#> Formula: adj_prev ~ 1 + (1 | study) 
#>          phi ~ 0 + offset(derived_phi)
#>    Data: d (Number of observations: 5) 
#>   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#>          total post-warmup draws = 4000
#> 
#> Multilevel Hyperparameters:
#> ~study (Number of levels: 5) 
#>               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sd(Intercept)     1.20      0.57     0.46     2.74 1.01      799      932
#> 
#> Regression Coefficients:
#>           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept    -4.34      0.61    -5.53    -3.02 1.00      917     1125
#> 
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).

# Function to calculate the mean of the logit-normal distribution
mean_logit_normal <- function(mu, sigma) {
  # Density function of X
  dx <- function(x) dnorm(x, mean = mu, sd = sigma)
  
  # Transformation function
  transformation <- function(x) 1 / (1 + exp(-x))
  
  # Compute mean using numerical integration
  mean_Y <- integrate(function(x) transformation(x) * dx(x), lower = -Inf, upper = Inf)$value
  
  return(mean_Y)
}

# Calculate the mean of the logit-normal distribution for each pair of mu and sigma
tibble(mu = m1 %>% as_draws_rvars() %>% pluck("Intercept") %>% draws_of() %>% as.vector(),
       sigma = m1 %>% as_draws_rvars() %>% pluck("sd_study__Intercept") %>% draws_of() %>% as.vector()) %>%
  mutate(mean_Y = map2_dbl(mu, sigma, mean_logit_normal, .progress = T)) -> 
  parameters

parameters$mean_Y %>% rvar()
#> rvar<4000>[1] mean ± sd:
#> [1] 0.034 ± 0.034

Created on 2024-04-24 with reprex v2.1.0

So everything runs successfully, but…

  1. Not 100% convinced I have calculated derived_phi correctly.
  2. posterior mean of 0.034 seems too high to be correct to me.

Thanks again for ongoing help - I am learning a lot!

The reason why the mean of the posterior of the logit-normal mean is so high is due to a long tail to the right. Maybe the median is more reasonable to use. Or the inverse-transformed intercept. I am a bit unsure…

> d <- tribble(~study, ~cases, ~population, ~region,
+              "A", 23, 4651, "AFR",
+              "B", 451, 76420, "AFR",
+              "C", 101, 12456, "WPR",
+              "D", 4, 391, "WPR",
+              "E", 66, 2318, "AFR") %>%
+   mutate(prev = cases/population,
+          prev100k = prev*100000,
+          adj_prev = c(0.0061, 0.0049, 0.0086, 0.021, 0.035),
+          adj_prev_ci_lo = c(0.0023, 0.0035, 0.060, 0.003, 0.017),
+          adj_prev_ci_hi = c(0.0062, 0.0063, 0.11, 0.067, 0.056),
+                             adj_cases = round(adj_prev*population),
+          adj_se = (adj_prev_ci_hi-adj_prev_ci_lo)/3.92,
+          derived_phi = (adj_prev/adj_se^2) - (adj_prev/adj_se)^2 - 1)
> m1<- brm(formula = bf(adj_prev ~ 1 + (1|study), phi ~ 0 + offset(derived_phi)), 
+          data = d, 
+          family = Beta(link_phi = "identity"), 
+          backend = "cmdstanr", 
+          cores = 4, 
+          control=list(adapt_delta=.99)) # To get rid of the divergent transitions
Model executable is up to date!
Start sampling
Running MCMC with 4 parallel chains...

Chain 3 finished in 2.5 seconds.
Chain 1 finished in 3.0 seconds.
Chain 2 finished in 3.5 seconds.
Chain 4 finished in 4.6 seconds.

All 4 chains finished successfully.
Mean chain execution time: 3.4 seconds.
Total execution time: 5.1 seconds.

> summary(m1)
 Family: beta 
  Links: mu = logit; phi = identity 
Formula: adj_prev ~ 1 + (1 | study) 
         phi ~ 0 + offset(derived_phi)
   Data: d (Number of observations: 5) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~study (Number of levels: 5) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.21      0.60     0.46     2.76 1.01      618      942

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -4.30      0.68    -5.49    -2.63 1.00      857      825

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
> # Function to calculate the mean of the logit-normal distribution
> mean_logit_normal <- function(mu, sigma) {
+   # Compute mean using numerical integration
+   # This is the by the way the law of the unconscious statistician (LOTUS)
+   # plogis is expit, i.e. inverse logit
+   return(integrate(function(x) plogis(x) * dnorm(x, mu, sigma), lower = -Inf, upper = Inf)$value)
+ }
> # Calculate the mean of the logit-normal distribution for each pair of mu and sigma
> tibble(mu = m1 %>% as_draws_rvars() %>% pluck("Intercept") %>% draws_of() %>% as.vector(),
+        sigma = m1 %>% as_draws_rvars() %>% pluck("sd_study__Intercept") %>% draws_of() %>% as.vector()) %>%
+   mutate(mean_Y = map2_dbl(mu, sigma, mean_logit_normal, .progress = T)) -> 
+   parameters
> parameters$mean_Y %>% rvar()
rvar<4000>[1] mean ± sd:
[1] 0.037 ± 0.047 
> expit <- function (x) {1/(1 + exp(-x))} # since rvars cannot handle qlogis
> # compare to the back-transformed intercept
> expit(as_draws_rvars(m1)$Intercept)
rvar<1000,4>[1] mean ± sd:
[1] 0.018 ± 0.021 
> # Maybe more reasonable??
> # Fairly close to:
> d %>% with(mean(adj_prev))
[1] 0.01512
> #Distribution of effects over all studies 
> (rvar_rng(rnorm, 1, as_draws_rvars(m1)$Intercept, as_draws_rvars(m1)$sd_study__Intercept) %>% expit() -> estimate_dist)
rvar<1000,4>[1] mean ± sd:
[1] 0.037 ± 0.095 
> # We can visualise it all here:
> library(ggdist)
> d %>% mutate(pooled_adj_prev = expit(as_draws_rvars(m1)$r_study+as_draws_rvars(m1)$Intercept), median = NA) %>% 
+   add_row(study = "Pooled estimate?", 
+           pooled_adj_prev = rvar(parameters$mean_Y), 
+           adj_prev = mean(rvar(parameters$mean_Y)), 
+           median = median(rvar(parameters$mean_Y))) %>% 
+   add_row(study = "Back-transformed direct intercept", 
+           pooled_adj_prev = expit(as_draws_rvars(m1)$Intercept), 
+           adj_prev = mean(expit(as_draws_rvars(m1)$Intercept))) %>% 
+   add_row(study = "Distribution of effects over all studies", 
+           pooled_adj_prev = estimate_dist, 
+           adj_prev = mean(estimate_dist)) %>% 
+   mutate(study = factor(study, levels = study) %>% fct_rev()) %>% 
+   ggplot(aes(xdist = `pooled_adj_prev`, y = study))+stat_slab()+
+   geom_point(aes(x = adj_prev))+
+   geom_hline(yintercept = 2.5,linetype="dashed")+
+   geom_errorbarh(aes(xmin = adj_prev_ci_lo, xmax =  adj_prev_ci_hi), height = 0.1)+
+   scale_x_continuous(limits=c(0, 0.13))+
+   geom_point(aes(x = median), 
+              color = "blue")

1 Like

Yes - that makes sense, and I will experiment with the real data. Very grateful indeed!
Peter