Why can't I recover measurement error for logistic regression?

walkerharrison · March 8, 2021, 5:56am

Let’s imagine a scenario where a binary variable is drawn from a vector of probabilities that we can convert to the log-odds scale:

library(boot)
library(tidyverse)
library(rstan)
set.seed(0)

N <- 10000

logodds_true <- rnorm(N)
p_true <- inv.logit(logodds_true)
Y <- rbinom(N, 1, p_true)

We come up with a measured log-odds for each outcome that contains some error:

sigma_error <- 0.5
logodds_meas <- logodds_true + rnorm(N, sd = sigma_error)

I’ve tried to write a model that mimics this data generating process:

data {
  int<lower = 0> N;
  vector[N] logodds_meas;
  int<lower = 0, upper = 1> Y[N];
}
parameters {
  vector[N] logodds_true;
  real<lower = 0> sigma_error;
}
model {
  // prior not super important -- going with half normal
  sigma_error ~ normal(0, 1);
  
  // noisy measurements of true log-odds
  (logodds_meas - logodds_true) ~ normal(0, sigma_error);
  Y ~ bernoulli_logit(logodds_true);
}

But for some reason, I can’t recover sigma_error – the posterior always gives back far too large of an estimate, even with excessively high N. For example, the posterior mean in the above example is around 1.2, whereas the actual standard deviation of the error was 0.5.

Anyone know what I’m doing wrong here?

jsocolar · March 8, 2021, 11:39pm

To estimate sigma_error from the data, the data need to be able to constrain logodds_true. But each element of logodds_true is informed only by a single Bernoulli observation. This isn’t enough to constrain the values of logodds_true (consider that for every element, the maximum likelihood estimate for logodds_true based only on the data and not the hyperparameters is either infinity or negative infinity). So what’s likely happening here is that the data give almost no information about logodds_true, and most of the information about logodds_true comes from the hyperparameters and ultimately the prior (which is extremely important in this model, contrary to the comment in your code).

I suspect that if you had even a small handful of Bernoulli trials per observation (i.e. you had a Binomial distribution with trials > 1), you would quickly see convergence to the true value of sigma_error as the number of trials increases.

js592 · March 9, 2021, 12:19am

I think this is correct. A common trick is to put a hyperprior on logodds_true that is estimated from the data. E.g. try adding something like logodds_true ~ normal(mu_logodds_measured, sigma_logodds_measured) and log_odds_measured ~ normal(mu_logodds_measured, sigma_logodds_measured)

walkerharrison · March 9, 2021, 12:57am

Ah, that makes sense. If in practice I only have a single Bernoulli trial per observation, then is the next best thing to start bucketing observations by their (rounded) log-odds?

walkerharrison · March 9, 2021, 1:39am

this worked like a charm. Only change I made is to the logodds_true distribution, since I think you have to subtract out the measurement variance:

logodds_true ~ normal(mu_logodds_meas, sqrt(sigma_logodds_meas^2 - sigma_error^2));

Topic		Replies	Views
Correlated posterior sample in logistic model with measurement error in probability scale Modeling techniques , logistic-regression , reparametrization	1	735	January 30, 2022
Binomial Regression w/ Log N Error in Predictor brms	0	120	May 9, 2024
Measurement error models on lognormal distributed data brms specification	8	265	November 7, 2024
How to model this measurement error problem best? Modeling techniques , specification	9	783	September 20, 2021
Measurement error - manual example on pp. 203-4 Modeling	9	1891	January 3, 2019

Why can't I recover measurement error for logistic regression?

Related topics