# Why can't I recover measurement error for logistic regression?

Let’s imagine a scenario where a binary variable is drawn from a vector of probabilities that we can convert to the log-odds scale:

``````library(boot)
library(tidyverse)
library(rstan)
set.seed(0)

N <- 10000

logodds_true <- rnorm(N)
p_true <- inv.logit(logodds_true)
Y <- rbinom(N, 1, p_true)
``````

We come up with a measured log-odds for each outcome that contains some error:

``````sigma_error <- 0.5
logodds_meas <- logodds_true + rnorm(N, sd = sigma_error)
``````

I’ve tried to write a model that mimics this data generating process:

``````data {
int<lower = 0> N;
vector[N] logodds_meas;
int<lower = 0, upper = 1> Y[N];
}
parameters {
vector[N] logodds_true;
real<lower = 0> sigma_error;
}
model {
// prior not super important -- going with half normal
sigma_error ~ normal(0, 1);

// noisy measurements of true log-odds
(logodds_meas - logodds_true) ~ normal(0, sigma_error);
Y ~ bernoulli_logit(logodds_true);
}
``````

But for some reason, I can’t recover `sigma_error` – the posterior always gives back far too large of an estimate, even with excessively high `N`. For example, the posterior mean in the above example is around 1.2, whereas the actual standard deviation of the error was 0.5.

Anyone know what I’m doing wrong here?

To estimate `sigma_error` from the data, the data need to be able to constrain `logodds_true`. But each element of `logodds_true` is informed only by a single Bernoulli observation. This isn’t enough to constrain the values of `logodds_true` (consider that for every element, the maximum likelihood estimate for `logodds_true` based only on the data and not the hyperparameters is either infinity or negative infinity). So what’s likely happening here is that the data give almost no information about `logodds_true`, and most of the information about `logodds_true` comes from the hyperparameters and ultimately the prior (which is extremely important in this model, contrary to the comment in your code).

I suspect that if you had even a small handful of Bernoulli trials per observation (i.e. you had a Binomial distribution with trials > 1), you would quickly see convergence to the true value of `sigma_error` as the number of trials increases.

3 Likes

I think this is correct. A common trick is to put a hyperprior on `logodds_true` that is estimated from the data. E.g. try adding something like `logodds_true ~ normal(mu_logodds_measured, sigma_logodds_measured)` and `log_odds_measured ~ normal(mu_logodds_measured, sigma_logodds_measured)`

1 Like

Ah, that makes sense. If in practice I only have a single Bernoulli trial per observation, then is the next best thing to start bucketing observations by their (rounded) log-odds?

this worked like a charm. Only change I made is to the `logodds_true` distribution, since I think you have to subtract out the measurement variance:

``````logodds_true ~ normal(mu_logodds_meas, sqrt(sigma_logodds_meas^2 - sigma_error^2));
``````