Why can't I recover measurement error for logistic regression?

Let’s imagine a scenario where a binary variable is drawn from a vector of probabilities that we can convert to the log-odds scale:

library(boot)
library(tidyverse)
library(rstan)
set.seed(0)

N <- 10000

logodds_true <- rnorm(N)
p_true <- inv.logit(logodds_true)
Y <- rbinom(N, 1, p_true)

We come up with a measured log-odds for each outcome that contains some error:

sigma_error <- 0.5
logodds_meas <- logodds_true + rnorm(N, sd = sigma_error)

I’ve tried to write a model that mimics this data generating process:

data {
  int<lower = 0> N;
  vector[N] logodds_meas;
  int<lower = 0, upper = 1> Y[N];
}
parameters {
  vector[N] logodds_true;
  real<lower = 0> sigma_error;
}
model {
  // prior not super important -- going with half normal
  sigma_error ~ normal(0, 1);
  
  // noisy measurements of true log-odds
  (logodds_meas - logodds_true) ~ normal(0, sigma_error);
  Y ~ bernoulli_logit(logodds_true);
}

But for some reason, I can’t recover sigma_error – the posterior always gives back far too large of an estimate, even with excessively high N. For example, the posterior mean in the above example is around 1.2, whereas the actual standard deviation of the error was 0.5.

Anyone know what I’m doing wrong here?

To estimate sigma_error from the data, the data need to be able to constrain logodds_true. But each element of logodds_true is informed only by a single Bernoulli observation. This isn’t enough to constrain the values of logodds_true (consider that for every element, the maximum likelihood estimate for logodds_true based only on the data and not the hyperparameters is either infinity or negative infinity). So what’s likely happening here is that the data give almost no information about logodds_true, and most of the information about logodds_true comes from the hyperparameters and ultimately the prior (which is extremely important in this model, contrary to the comment in your code).

I suspect that if you had even a small handful of Bernoulli trials per observation (i.e. you had a Binomial distribution with trials > 1), you would quickly see convergence to the true value of sigma_error as the number of trials increases.

3 Likes

I think this is correct. A common trick is to put a hyperprior on logodds_true that is estimated from the data. E.g. try adding something like logodds_true ~ normal(mu_logodds_measured, sigma_logodds_measured) and log_odds_measured ~ normal(mu_logodds_measured, sigma_logodds_measured)

1 Like

Ah, that makes sense. If in practice I only have a single Bernoulli trial per observation, then is the next best thing to start bucketing observations by their (rounded) log-odds?

this worked like a charm. Only change I made is to the logodds_true distribution, since I think you have to subtract out the measurement variance:

logodds_true ~ normal(mu_logodds_meas, sqrt(sigma_logodds_meas^2 - sigma_error^2));