Betabinomial and importance sampling

Hi there,
I need advice on modeling my use case. Lets assume you have a population of X items and you have N trials. Each item i from the overall population is selected/sampled to be evaluated with probability p_{i}. We select Y out of X based on those p_{i} values (similar to importance sampling and so we can assume w_{i} = 1 / p_{i}). Now, for each item that we have in the experiment we get a label that could be zero or one. Given that we have received N positive labels out of Y trials, we want to infer the probability of success (seeing positive label) for the overall population of size X.

If all Y items were selected similarly, i.e with probability 1/X, then we simply had a BetaBinomial model, sth like the following and we could say that p would be the success rate for the overall population.

model {
  p ~ Beta(1, 1); 
  N ~ Binomial(Y, p);

My question is that how we can model this and infer the overall success probability given that each item comes with a different weight. Note that the size of data is big (X is in the order of billion and Y is in the order of 100K, so the model should be scalable and sth like PoissonBinomial is not an option)

Thanks in advance for your help

1 Like

I’m confused by your description of the problem. Why is Y not indexed by i? If N the number of trials, then what does


If you can write you model down in mathematical form, that would be quite helpful.

Thanks for looking at my question. Let me give the following notations:
X: the set of all items in our population, |X|: size of population
Y: the subset of items that we selected from X and send for evaluation, |Y|: size of evaluation set

We also know that

  1. |Y| << |X|
  2. each item i is sampled from X and put into Y with probability p_{i}. In other words, we can say that item i is placed in Y as a representative of w_{i} = 1/ p_{i} other items in X
  3. N out of |Y| is evaluated positive

as an example lets assume we want to find the rate of breast cancer in the whole world, so X would be the whole word.
We select a person from a specific country with probability p_{i} such that w_{i} roughly shows the size of that country.
Y would be the set of persons that we selected for evaluation
N is the number of them diagnosed with cancer after evaluation

That is what I have tried so far: I bucketed set Y based on some features and then tried to find the cancer rate per bucket as follow:
assume that W_a is the sum of w_{i} for all items in bucket a
assume N_a is the num of positive cases in bucket a
assume |Y_a| is the size of all patients in bucket a

g ~ Beta(1, 1) 
t ~ Beta(g * W_a , (1-g) * W_a) 
N_a ~ Binomial( |Y_a|, t)

However, the results are pretty weird/incorrect. My question is how i should incorporate the w_{i} information.

Is the below a fair summary of the problem?

The members of some large population (like people on Earth) can be tagged with exactly one of a moderate number of labels (like the countries they live in), thereby allocating members to one of a moderate number of groups. We know a priori what fraction of the total population belongs to each group. We sample members according to some procedure that randomizes sampling within groups, but not necessarily across groups (there might be some arbitrary stratification procedure across groups, or the sample might be a convenience sample that is nonrandom with respect to the grouping; the details don’t matter to this answer).

Now the question is, given that we don’t have an iid random sample of members from the entire population, how can we estimate a population-level parameter?

To obtain this, you first need to get a posterior distribution for the group-specific means. Perhaps a random intercepts model would be useful. Then, you need to take the weighted mean of these group means. The necessary weights have nothing to do with the procedure that you used to select individuals for measurement. Rather, they are the relative sizes of each group in the population (not the relative sizes in the sample).

The general name for this technique is poststratification, and the Stan users guide has a nice chapter on doing poststratification with Stan: 29 Poststratification | Stan User’s Guide

Thanks for you replay @jsocolar and the pointer. There is one point that I have difficulty getting it. In my use case, when I use binomial, like what is said here, not all the trials are the same. I stratify/bucketize items based on some features that have nothing to do with item weights (i.e w_{i}), then in each bucket I use Binomial. However not all observed successes are the same in one bucket. why? let me explain.

Lets assume you have two items in one strata/bucket. one evaluated as success and has w_{i} = 10K and one as failuare and has w_{i} = 100K. When we apply Binomial here, we should incorporate those weights here I believe since it is very naive to say that 1 out of two trials are success. Also you can not say that 10K out of (10K + 100K) is success because you have not evaluated that much of items.

You may say that I should stratify based on weights, but in my use case it really does not make sense that much and that is why I stratify them based on some other features.

So, I know how to combine results over all buckets, but I do not know how modeling per bucket should be done assuming that a strata/bucket might be a collection of subgroups where each group comes with a different w_{i}

OK, let’s back up a bit. As you say, |X| is the size of the population. Let Y_i be an indicator variable that says whether individual i is sampled or not, i.e. Y_i \in \{0, 1\}. If we select an individual based on a probability p_i, then Y_i \sim \operatorname{Bernoulli}(p_i). Note that \sum_{i=1}^{|X|} Y_i = |Y|.

Now let Z_i \in \{0, 1\} be an indicator variable of whether individual i once selected (or not) will be positive or not. Then

Z_i | Y_i = \begin{cases} 0, \textrm{if}\: Y_i = 0,\\ \operatorname{Bernoulli}(\theta), \textrm{if} \: Y_i = 1. \end{cases}

In turn, this means that

\begin{align} \operatorname{Pr}(Z_i = 1) &= \sum_{k=0}^1 \operatorname{Pr}(Z_i = 1| Y_i = k) \operatorname{Pr}(Y_i = k), \\ &= 0\cdot (1-p_i) + \theta p_i,\\ &= \theta p_i. \end{align}

Now, if you know what the p_i are exactly and you have disaggregated data, you can estimate the ‘positivity’, \theta. This encompasses the case where p_i can assume one of J values, where J is the number of buckets or groups.

1 Like

I think there are a couple of important strands to this conversation, and ultimately the question still isn’t fully clear.

First a minor clarification of @maxbiostat’s previous post. Z_i is the probability that an individual will be selected and positive, not the probability of being positive once selected (i.e. conditional on Y_i=1).

Second, I am confused about where we are assuming that variation in positivity exists. Are we estimating a single positivity \theta that we assume is constant across all regions/buckets/whatever? Or are we assuming that \theta_i is itself variable by stratum? And crucially, what strata are assumed to have variable \theta?

I also feel lost in the language of strata, buckets, and weights. Let’s go back to the breast cancer example as a concrete example to make sure that we are talking about the same stuff.

Suppose we want to know the fraction of people in the world with breast cancer, and we believe breast cancer rates are variable by country. Then we collect data, do inference on breast cancer rates within countries, and then post-stratify using weights that reflect the relative populations of each country.

Now it seems that you need to introduce a second complication, which is that even within countries, we don’t have iid random samples of individuals. Rather, we have individuals sampled according to some set of weights p_i. A crucial question is whether those sampling weights are related to the probability of having breast cancer. If they are not, then we can ignore the weights completely. But if they are, then we need to account for the resulting systematic bias.

If the weights p_i are related to membership in some other grouping that is related to breast cancer incidence (maybe age group, for example), then we need to post-stratify by that grouping as well. And in fact, we need to post-stratify by the interaction of country and age group. To do that, we either need a very large data volume (such that each country-by-age is well informed), or we need a decent statistical model that can allow us to predict to country-by-age categories that are data poor (or even missing from the dataset).

1 Like

Thanks @maxbiostat for your reply. The thing that I should emphasize is that we can not work with individual items sine dataset is super big and that is why I tried to squeeze observations by using Binomial instead of benrnoulli. In your write-up, I assume, we should infer one \theta per item which is not feasible.

Remember that if Y_i = 1 then item i comes as (a_1, …, a_n, w_i) → Z_i = 1 / 0 where a_1…a_n are the features and w_i is the representative weight of that item

I probably wasn’t as clear as I should in my post.

Correct. I’ve already updated my post.

So am I, but the point of my calculation was to show that even if positivity is the same across strata, unless you know the p_i exactly, you’ll have identifiability issues.

This is the whole debate about weights in surveys. @lauren, @bgoodri , @jonah and @Guido_Biele have participated in a number of discussions on this forum on what the appropriate way to incorporate weights is.

1 Like

Now it seems that you need to introduce a second complication, which is that even within countries, we don’t have iid random samples of individuals. Rather, we have individuals sampled according to some set of weights p_i.

that is true. In the breast cancer toy example, it means that individuals in each country are sampled differently and this difference is reflected in w_i. The calculation of w_i is a blackbox model and we only know that the bigger w_i the more people can be represented by that sample.

A crucial question is whether those sampling weights are related to the probability of having breast cancer. If they are not, then we can ignore the weights completely. But if they are, then we need to account for the resulting systematic bias.

we really don’t know whether sampling weight w_i will bias the final estimate or not. By final estimate I am referring to what we output after post stratification over the whole population.

So far, I have assume Binomial per bucket and then did post stratification to output the estimate for the whole population. But now I feel that because of not incorporating w_{i} in inference, the results would be biased.

By definition if the response is the same iid response irrespective of whether or not we condition on the weights, the generative model for the data observed in a weighted sample is identical to the generative model for the data from an unweighted sample. However, increasingly uneven weights can inflate minor deviations from iid into serious downstream biases.

I’ve lost track of what “bucket” means in this context, and whether it pertains to a grouping that is associated with the response, associated with the weights, or both.

I’ve lost track of what “bucket” means in this context, and whether it pertains to a grouping that is associated with the response, associated with the weights, or both.

in my use case, I do group all the items in set Y into buckets/strata based on a combination of features a_1,…, a_n. Then I use binomial, as the likelihood, in each bucket to map the num of success and the number of items in the bucket. so, bucketing is based on some features but it is completely agnostic to w_i.

for post-stratification, I bucketize the whole population similar to what I did for Y and estimate the overall success rate.

So, in my current model, w_i is completely ignored but I think I should find a way to have it as part of training.

How are the w_i chosen/generated?

they are coming from a blackbox model, a very complex pipeline that is not presented to us. As I mentioned earlier, we only know:

  1. the bigger w_{i}, the more items (people in the cancer example) can be represented by that single sample.
  2. we have this piece of info only for items that are in set Y, i.e those selected for actual evaluation

In that case I think it’s plausible to worry that w_i might be correlated to the probability that items test positive. But given that you don’t know the w_i for items not actually sampled, I’m at a loss to recommend how to use this information to yield better inference about the items that were not actually sampled.

@jsocolar I wonder why do we need w_i for items that we have not samples? How does it help?

Note that the population is super big and in the post stratification I can simply find the fraction that fall into each strata (knowing that each item has weight = 1) to calculate the final estimate.

Because we’re worried that w_i is actually related to the response (if it is genuinely completely unrelated to the response, it can safely be ignored). We observe the response for some individuals, and we need a statistical model for the response of the others. But since we don’t know w_i for the others, it becomes very challenging to leverage w_i to help us predict the response.

hmm, not quite sure. I see Y as the squeezed version of the whole population and in the post stratification we give weight = 1 to every single item from population and calculate the fraction of population that fall into each bucket. Note that, we infer rate per bucket and we implicitly assume that all items in a bucket have the same success rate which is used in the Binomial likelihood. right?

If that assumption is correct–that all items in a bucket are iid irrespective of their weights, then the weights don’t matter. If every item in the bucket is Bernoulli distributed with identical probability, then the generative model for a sample biased towards high weights will be strictly identical to the generative model for a sample biased towards low weights or for a random sample that doesn’t depend on the weights. I could choose a deterministic rather than stochastic procedure for selecting individuals (all weights equal to either one or zero), and if the data generating process is genuinely totally independent of my selection procedure I will get proper inference about the (within-bucket) population rate.

I found this solution from Gelman’s paper under “Accounting for Survey weights”, Eq 6. Wanted to collect feedback on the \text{design-effect} parameter defined in the paper. Could you help me understand the rational behind this definition of \text{design-effect} if possible @jsocolar and @maxbiostat ? the rest of the calculation in the paper makes sense to me.

y_{j}: the outcome/label in strata j, so it is zero or one
\bar{y_{j}}: the weighted average of y_{j} in strata j
n_{j}: the number of outcomes in strata j
\text{design-effect} : defined in the linked paper

so, incorporting the weights changes the likelihood to the following:

(\bar{y_{j}} * n_{j}) \sim Binomial(n_{j} / \text{design-effect}, \theta{j})
where \theta{j} is the estimated prevalence for strata j that we may put a prior on that.


1 Like