Binomial measurement error model

I would like to fit a binomial regression model, where both there is measurement error for both the numerator and denominator. The back-story is that the data comes from a survey that has been adjusted (using sampling weights) to produce the counts and denominators. I do not have access to the original data, just the (adjusted) counts and denominators, and the margin of error for each one. The MOE can be converted to a standard error.

My first question is a Stan model specification question. The Stan User’s Guide gives an example of fitting a Bayesian measurement error model, but the key difference is that, since this is a binomial model, the outcome and denominator counts have to be integers. I tried specifying that model as follows:

data {
  ...
  vector[N] Y_sd;
  vector[N] trials_sd;
}
parameters {
  ...
  int<lower=0> Y_true[N];
  int<lower=0> trials_true[N];
}
model {
  ...
  Y_obs ~ normal(Y_true, Y_sd);
  trials_obs ~ normal(trials_true, trials_sd);
}

There are two (potential) problems with this specification:

  1. I got an error message: “(Transformed) Parameters cannot be integers.” How do I declare the “true” values of Y and trials and force them to be integers if I can’t declare them as integers?
  2. Even if I’m able to declare them as integers, am I allowed to specify the measurement error model as normal when they are integers? I’m guessing not. What would be a more appropriate distribution here? Keep in mind that I really have no idea about the details of how the adjusted counts and denominators were generated, I was just provided with them and the MOE (which I was told was the width of a 90% CI).

My other question is if there is any way to express this model in brms, but I think that might be a little ambitious.

This does not seem particularly difficult to model, but can you clarify how your data looks like? Your description indicates that you only have an (adjusted) aggregate ratio of a (binary?) response variable, whereas the [N] suffix on your data seems to denote individual survey responses. The best way to go about this in the latter case would be to model response probabilities, whereas for the first scenario, I do not understand the modelling task, as that would more or less be a single data point.

Sorry for not being clear, and thanks for looking at this problem.

I do not have access to the original survey data. What I have is a data frame with one row for every observation (a demographic cell), and the following variables:

  • E_num (the estimated numerator)
  • SE_num (the standard error for that estimate)
  • E_den (the estimated denominator)
  • SE_den (the standard error for that denominator)
  • Other variables defining the demographic cell (like age group, gender, county, etc.)

The estimates and SEs were generated by another organization, and are based on applying some sort of sampling weights to the raw survey data (which, again, we don’t have access to). All of the estimates (both numerator and denominator) are integers (presumably rounded).

Thanks!

You could try to retain the integer constraint by nesting Binomial or Poisson distributions, but I hardly believe that this would yield a tractable model. The following is how I would go about it

parameters{
  trials_true;
  beta;
}
transformed parameters{  
  theta = inv_logit(X*beta);
}
model{
  trials_obs ~ normal(trials_true, trials_sd);
  Y_obs ~ normal(trials_obs*theta, Y_sd);
}
generated quantities{
  Y_true = trials_obs*theta;
}

The key thought here is that instead of modelling

Y_obs ~ normal(binomial(trials_obs,theta), Y_sd);

I’ve used the convergence of the binomial to the normal distribution, given that there are a reasonably large number of individuals in each demographic cell. This will of course yield real estimates of Y_true and trials_true. If you intend to embed these values into a larger model structure which requires them to be natural numbers, it would be very much relevant to know more about the framework in order to make appropriate choices in the above.

As a side note, it would be important to normalize Y_sd and trials_sd so that the 90% coverage corresponds to a 90% interval in the variance parametrization of the respective normal distribution.

Thanks for the reply.

However, one concern is that we actually do have some pretty small cells, and most of the probabilities are expected to be pretty low (a bunch of the Y_obs are zero). Wouldn’t that be exactly the type of situation where the normal approximation doesn’t hold?

Do you have a solution that doesn’t rely on the normal approximation?

Hm I don’t know a good solution to that. For small Y, the Binomial is better approximated by a Poisson, which you could try to impose on trials_obs and Y_obs. The issue with that is that just like the Binomial, the Poisson has a single parameter, so does not lend itself to parametrizing the mean and variance separately. Hacking yourself around that with sth like Y_obs ~ (Poisson(lamba) - lambda) * Y_sd + lambda doesn’t work either because the Poisson only has support on the positive domain.

Please do post an update if you find a satisfactory solution!

Since you want a binomial likelihood and using standard errors, would modeling the observations as binomial, and the latent parameters within that as theta work for you?

data {
  int<lower=0> N;                   // number of observations
  array[N] int<lower=0> Y_obs;      // observed success counts
  array[N] int<lower=0> trials_obs; // observed trial counts
  array[N] real<lower=0> Y_sd;      // standard errors for Y
  array[N] real<lower=0> trials_sd; // standard errors for trials
}

parameters {
  vector<lower=0>[N] trials_latent;                // latent continuous values for trials
  vector<lower=0,upper=trials_latent>[N] Y_latent; // latent continuous values for Y
}

model {
  Y_obs ~ normal(Y_latent, Y_sd);
  trials_obs ~ normal(trials_latent, trials_sd);
  vector[N] theta = Y_latent ./ trials_latent;
  Y_obs ~ binomial(trials_obs, theta );
}

1 Like

@ssp3nc3r Interesting idea, but we also need to incorporate covariates. i.e.,

inv_logit(theta) = X * beta;

I don’t think your specification allows for this?

Here’s additional structure:

theta ~ beta_proportion(inv_logit(X * beta), kappa);

Is this possible when the model already has theta fixed as:

vector[N] theta = Y_latent ./ trials_latent;

?

Also, in general it just seems odd to model the observed data as Binomial, when it’s really the “true” counts that are part of the DGP. Maybe this is the best we’re going to do though.

In my example, theta isn’t fixed, it’s elements are parameters.

You have several options for modeling, but first you need to know what parameters you are interested in.

This isn’t possible in Stan. You can’t literally translate the continuous model to the discrete setting because Stan doesn’t support discrete parameters.

You could

  1. do what @mippeqf suggests and use continuous approximations.
  2. Alternatively, if the max values of the binomial are small, you can marginalize out the latent true values.
  3. Use an alternative software package that allows blocking of discrete parameters, like NumPyro or PyMC in Python or NIMBlE in R. (At least I think these would all support the direct translation of the missing data model.)
1 Like