Hierarchical model to aggregate summary statistics and estimate a product of proportions ($P1 \times P2$)

I am trying to estimate the proportion of U.S. adults who use chatbots as a “friend.” (That’s not the exact RQ but it’s a good enough stand-in!)

I am estimating this as a product of two probabilities: P = P1 \times P2.

  1. P1: The proportion of U.S. adults who use chatbots.
  2. P2: The proportion of chatbot users who use them as a friend.

I do not have raw binomial count data. Instead, I have summary statistics (Means and Standard Errors) from various sources.

  • For P1: I have 3 estimates from unrelated secondary surveys.
  • For P2: I have 4 estimates from my own survey where I asked the question using 4 different phrasings, yielding different results.

I want to aggregate these estimates somehow to find the underlying parameters for P1 and P2, and then propagate the uncertainty to the product P to get a 95% Credible interval. What is the best way to do that?

Let’s say the proportions are:

# P1: Proportion of US adults who use chatbots (3 external estimates)
p1_means <- c(0.56, 0.5, 0.45)
p1_ses   <- c(0.01, 0.09, 0.01)

# P2: Proportion of users who use as friend (4 estimates from different question wordings)
p2_means <- c(0.20, 0.24, 0.18, 0.22)
p2_ses   <- c(0.04, 0.05, 0.03, 0.04)

Hi, @arc_circle_square, and welcome to the Stan forums.

You have what’s known as a meta-analysis problem, like @andrewgelman’s favorite example, eight schools. There’s a chapter in the Stan User’s Guide, the second section of which has a light introduction to meta-analysis:

In Stan, one way to code this is as follows, which I put into file friends.stan:

data {
  int<lower=0> N1, N2;
  vector<lower=0, upper=1>[N1] p1_mean;
  vector<lower=0>[N1] p1_se;
  vector<lower=0, upper=1>[N2] p2_mean;
  vector<lower=0>[N2] p2_se;
}
parameters {
  real<lower=0, upper=1> theta1, theta2;
}
model {
  p1_mean ~ normal(theta1, p1_se);
  p2_mean ~ normal(theta2, p2_se);
  theta1 ~ beta(2, 2);
  theta2 ~ beta(2, 2);
}
generated quantities {
  real<lower=0, upper=1> p_friend = theta1 * theta2;
}

and the .json data file data.json looks like this:

{
    "N1": 3,
    "p1_mean": [0.56, 0.5, 0.45],
    "p1_se": [0.01, 0.09, 0.01],
    "N2": 4,
    "p2_mean": [0.20, 0.24, 0.18, 0.22],
    "p2_se":  [0.04, 0.05, 0.03, 0.04]
}

Here’s the fit:

>>> import cmdstanpy as csp

>>> m = csp.CmdStanModel(stan_file='friends.stan')

>>> f = m.sample(data='data.json')

>>> f.summary(sig_figs=2)

           Mean     MCSE  StdDev     MAD      5%    50%    95%  ESS_bulk  ESS_tail  R_hat
lp__     -38.00  0.02500  1.0000  0.7600 -40.000 -38.00 -37.00    1700.0    2200.0    1.0
theta1     0.51  0.00011  0.0071  0.0071   0.490   0.51   0.52    4000.0    2200.0    1.0
theta2     0.20  0.00031  0.0190  0.0190   0.170   0.20   0.24    3900.0    2900.0    1.0
p_friend   0.10  0.00016  0.0100  0.0100   0.086   0.10   0.12    3800.0    2800.0    1.0

The 90% posterior interval for friendship is (0.086, 0.12).

I’m not totally happy with this because of the means are constrained to (0, 1), but adding +/- standard errors cold take you out of support. Luckily here, the standard errors are much smaller scale than the means so that +/- 4 standard errors still doesn’t take you out of support, so it should be OK. An alternative would’ve been to collect log odds and unconstrained standard errors.

You could also play around with the priors on theta1 and theta2—I chose weak, but not totally uniform priors. You could use beta(1, 1) to make them uniform. Everything will stay well behaved because of the small data sizes and constraints.