Binomial measurement error model

jgellar · July 25, 2024, 2:13pm

I would like to fit a binomial regression model, where both there is measurement error for both the numerator and denominator. The back-story is that the data comes from a survey that has been adjusted (using sampling weights) to produce the counts and denominators. I do not have access to the original data, just the (adjusted) counts and denominators, and the margin of error for each one. The MOE can be converted to a standard error.

My first question is a Stan model specification question. The Stan User’s Guide gives an example of fitting a Bayesian measurement error model, but the key difference is that, since this is a binomial model, the outcome and denominator counts have to be integers. I tried specifying that model as follows:

data {
  ...
  vector[N] Y_sd;
  vector[N] trials_sd;
}
parameters {
  ...
  int<lower=0> Y_true[N];
  int<lower=0> trials_true[N];
}
model {
  ...
  Y_obs ~ normal(Y_true, Y_sd);
  trials_obs ~ normal(trials_true, trials_sd);
}

There are two (potential) problems with this specification:

I got an error message: “(Transformed) Parameters cannot be integers.” How do I declare the “true” values of Y and trials and force them to be integers if I can’t declare them as integers?
Even if I’m able to declare them as integers, am I allowed to specify the measurement error model as normal when they are integers? I’m guessing not. What would be a more appropriate distribution here? Keep in mind that I really have no idea about the details of how the adjusted counts and denominators were generated, I was just provided with them and the MOE (which I was told was the width of a 90% CI).

My other question is if there is any way to express this model in brms, but I think that might be a little ambitious.

mippeqf · July 25, 2024, 3:14pm

This does not seem particularly difficult to model, but can you clarify how your data looks like? Your description indicates that you only have an (adjusted) aggregate ratio of a (binary?) response variable, whereas the [N] suffix on your data seems to denote individual survey responses. The best way to go about this in the latter case would be to model response probabilities, whereas for the first scenario, I do not understand the modelling task, as that would more or less be a single data point.

jgellar · July 25, 2024, 3:24pm

Sorry for not being clear, and thanks for looking at this problem.

I do not have access to the original survey data. What I have is a data frame with one row for every observation (a demographic cell), and the following variables:

E_num (the estimated numerator)
SE_num (the standard error for that estimate)
E_den (the estimated denominator)
SE_den (the standard error for that denominator)
Other variables defining the demographic cell (like age group, gender, county, etc.)

The estimates and SEs were generated by another organization, and are based on applying some sort of sampling weights to the raw survey data (which, again, we don’t have access to). All of the estimates (both numerator and denominator) are integers (presumably rounded).

Thanks!

mippeqf · July 25, 2024, 5:30pm

You could try to retain the integer constraint by nesting Binomial or Poisson distributions, but I hardly believe that this would yield a tractable model. The following is how I would go about it

parameters{
  trials_true;
  beta;
}
transformed parameters{  
  theta = inv_logit(X*beta);
}
model{
  trials_obs ~ normal(trials_true, trials_sd);
  Y_obs ~ normal(trials_obs*theta, Y_sd);
}
generated quantities{
  Y_true = trials_obs*theta;
}

The key thought here is that instead of modelling

Y_obs ~ normal(binomial(trials_obs,theta), Y_sd);

I’ve used the convergence of the binomial to the normal distribution, given that there are a reasonably large number of individuals in each demographic cell. This will of course yield real estimates of Y_true and trials_true. If you intend to embed these values into a larger model structure which requires them to be natural numbers, it would be very much relevant to know more about the framework in order to make appropriate choices in the above.

As a side note, it would be important to normalize Y_sd and trials_sd so that the 90% coverage corresponds to a 90% interval in the variance parametrization of the respective normal distribution.

jgellar · July 25, 2024, 10:29pm

Thanks for the reply.

However, one concern is that we actually do have some pretty small cells, and most of the probabilities are expected to be pretty low (a bunch of the Y_obs are zero). Wouldn’t that be exactly the type of situation where the normal approximation doesn’t hold?

Do you have a solution that doesn’t rely on the normal approximation?

mippeqf · July 27, 2024, 10:07am

Hm I don’t know a good solution to that. For small Y, the Binomial is better approximated by a Poisson, which you could try to impose on trials_obs and Y_obs. The issue with that is that just like the Binomial, the Poisson has a single parameter, so does not lend itself to parametrizing the mean and variance separately. Hacking yourself around that with sth like Y_obs ~ (Poisson(lamba) - lambda) * Y_sd + lambda doesn’t work either because the Poisson only has support on the positive domain.

Please do post an update if you find a satisfactory solution!

ssp3nc3r · July 27, 2024, 7:22pm

Since you want a binomial likelihood and using standard errors, would modeling the observations as binomial, and the latent parameters within that as theta work for you?

data {
  int<lower=0> N;                   // number of observations
  array[N] int<lower=0> Y_obs;      // observed success counts
  array[N] int<lower=0> trials_obs; // observed trial counts
  array[N] real<lower=0> Y_sd;      // standard errors for Y
  array[N] real<lower=0> trials_sd; // standard errors for trials
}

parameters {
  vector<lower=0>[N] trials_latent;                // latent continuous values for trials
  vector<lower=0,upper=trials_latent>[N] Y_latent; // latent continuous values for Y
}

model {
  Y_obs ~ normal(Y_latent, Y_sd);
  trials_obs ~ normal(trials_latent, trials_sd);
  vector[N] theta = Y_latent ./ trials_latent;
  Y_obs ~ binomial(trials_obs, theta );
}

jgellar · July 29, 2024, 3:36pm

@ssp3nc3r Interesting idea, but we also need to incorporate covariates. i.e.,

inv_logit(theta) = X * beta;

I don’t think your specification allows for this?

ssp3nc3r · July 29, 2024, 3:48pm

Here’s additional structure:

theta ~ beta_proportion(inv_logit(X * beta), kappa);

jgellar · July 29, 2024, 4:49pm

Is this possible when the model already has theta fixed as:

vector[N] theta = Y_latent ./ trials_latent;

?

Also, in general it just seems odd to model the observed data as Binomial, when it’s really the “true” counts that are part of the DGP. Maybe this is the best we’re going to do though.

ssp3nc3r · July 29, 2024, 4:54pm

In my example, theta isn’t fixed, it’s elements are parameters.

You have several options for modeling, but first you need to know what parameters you are interested in.

Bob_Carpenter · July 30, 2024, 7:23pm

This isn’t possible in Stan. You can’t literally translate the continuous model to the discrete setting because Stan doesn’t support discrete parameters.

You could

do what @mippeqf suggests and use continuous approximations.
Alternatively, if the max values of the binomial are small, you can marginalize out the latent true values.
Use an alternative software package that allows blocking of discrete parameters, like NumPyro or PyMC in Python or NIMBlE in R. (At least I think these would all support the direct translation of the missing data model.)

Topic		Replies	Views
Measurement error in nonlinear mixed effect models brms techniques , specification	13	1224	August 30, 2020
Regression with measurement error in brms brms specification	4	1581	October 6, 2021
Misclassification or measurement error for binomial model with brms (Statistical Rethinking 2023) brms brms	5	824	May 20, 2023
Specify rounding error and censoring for response in brms? brms	4	818	August 14, 2024
Parameterization of measurement error in brms brms	4	787	September 7, 2018

Binomial measurement error model

Related topics