Modeling hierarchical binomial parameters with censored observations

I’m looking for help wrapping my head around how to model this particular problem, here’s a statement of it.

Suppose I have J groups, indexed by j. Group j has size n_j. Each group member votes either yes or no on some measure. The proportion of “Yes” votes, y_j, can be modeled as a binomial distribution with unknown parameter \theta_j. We can write:

y_j \vert n_j, \theta_j \sim \mathrm{Binomial}(n_j, \theta_j)

We also assume that n_j and \theta_j are randomly drawn from distributions with hyperparameters. Just for simplicity, we’ll assume:

\theta_j \sim Beta(\alpha, \beta) \qquad n_j \sim \mathrm{Normal}(\mu, \sigma)

So far so good, but the catch is that we don’t directly observe either y_j or n_j, instead we observe the following quantities.

s_j = \max(2 y_j - n_j, 0) \qquad \pi_j =\frac{y_j}{n_j}

Here s_j is the quantity of Yes votes - No votes, censored for all values below 0. \pi_j is the percentage of total “Yes” votes.

This means that if s_j is positive, I can solve the system of equations with \pi_j to recover (y_j, n_j), but if it’s 0, then I don’t have enough information to do so. I’d like to estimate the hyperparameters (\alpha, \beta, \mu, \sigma) as well as estimate (y_j, n_j) for the cases where s_j = 0. Without the censoring, I have a good idea where to go, but those censored observations are throwing me for a loop.

Appreciate any help!

I suspect you’ll need more information or more assumptions.

Like it’s pretty easy to get in a situation where s_j is basically all zero and this doesn’t seem good:

> n = 10
> p = 0.2
> y = rbinom(1000, n, p)
> sum(pmax(2 * y - n, 0) > 0)
[1] 5

Do you know things about n and p that make you think this is possible? Probably the way to start on this is simulate fake data and see if you can make plots to try to back y and n back out.

When you say n_j is normally distributed, does that mean we are making a continuous approximation to the count parameter there? Could definitely make sense there if we want to impute n_j, but just wanted to make that clear.

Appreciate the response – based on the observations that I do have, on the order of ~10% have s_j = 0. So, I’m using that as reason to believe the I’d avoid the space where a high number of observations are 0.

The fake data simulation suggestion is spot on, that’ll be a good place to dig in.

As for n_j, stating that is was a normal distribution was me rather hastily saying that it’s drawn from some distribution – it is a count, so perhaps a Poisson distribution would have been a better characterization.

As for the censored data – is there any shortcut way for me to record that sort of information short of working out the log posterior distribution by hand and give that to Stan?

1 Like

Update: I took @bbbales2’s suggestion to simulate some fake data – I set some basic hyperparameters and generate a small data set.

What ended up working as a good strategy was to split my data into “censored” and “uncensored” data. My uncensored data, by definition, had s_j > 0. From here I was able to write down a transformed likelihood reparameterizing from (y_j, n_j) \to (s_j, \pi_j). I input my log-likelihood on the uncensored data directly.

As a note, I used n_j \sim \mathrm{Poisson}(\lambda) instead of the normal distribution, but I imagine it could have worked similarly.

For the censored data, I only had one data point \pi_j for each record. What I did here was define \tilde{y}_j as a parameter to be fit, and defined a log-likelihood where:

\tilde{y}_j \sim \mathrm{Binomial}(\tilde{y}_j / \pi_j, \theta_j)
\tilde{y}_j / \pi_j \sim \mathrm{Poisson}(\lambda)

This set up worked well for my fake data set and the posterior contained the true values of the hyperparameters in the center of the distribution, as well as estimate values like the censored vote counts rather well.

Conceptually, I was essentially letting the uncensored data estimate the hyperparameters fairly well, and then the censored data was able to get reasonable estimates off of that and the \pi_j data that was available.

I’m sure I’ll run into more challenges when running on real data/tackling some model expansions that I want to try.


Glad it worked!

One nice feature of simulated data is you know the truth so it’s easier to think about did I get the answer right or not.

This section of the manual has some stuff in it that might be relevant: 4 Truncated or Censored Data | Stan User’s Guide