Logistic regression with misclassification

Hi all, I’ve got a modeling situation where I’m not passing SBC and I can’t tell if my model is wrong, if my simulation is wrong, or both.

To reduce my setting to a toy example, suppose we have an image stream that is classified by AI as either containing or not containing some object, for example a chair. I want to fit a logistic regression to predict which images contain clear examples of chairs. Each observation is associated with a number (score) related to the classifier’s confidence that it contains a chair. Extensive manual within-sample validation confirms that the probability that the image contains a chair (conditional on the classifier score but not conditional on any covariates from the logistic regression) is a predictable, well-constrained function of the score. To human eyes, it is unambiguous whether each image does or does not contain a chair.

It turns out that below some middling score, the images literally never contain chairs. So the response y that I’m working with is effectively a vector of probabilities, including many zeros, representing the unconditional probabilities that there’s a chair in the image. It is acceptable to assume that false detections of chairs (i.e. y \neq 0 when the correct classification is zero) occur completely at random among the images that do not contain chairs (i.e. irrespective of the covariates that predict whether the image does contain a chair).

The challenge in modeling these data is that the AI itself is an opaque box, so I have no generative model for the scores/probabilities.

First steps towards modeling

To simplify the problem and get traction initially, I first assume that the data stream y contains only two unique elements: 0 and \rho, the unique nonzero probability, and I also assume that the logistic regression is intercept-only. In this case, the expected number of clearly identifiable chairs is \sum{y}, and the expected number of misclassified images is m = \sum{(y==1)} - \sum{y}.

This suggests the following Stan model:

data {
  int N;
  array[N] real y;
}

transformed data {
  real expected_true_ones = sum(y);  // "true ones" here means true state is one, not true state is one AND classifier output is one
  real expected_true_zeros = N - expected_true_ones;
  int observed_zeros = 0;
  for(i in 1:N){
    if(y[i] == 0){
      observed_zeros += 1;
    }
  }
  int observed_nonzeros = N - observed_zeros;
  real expected_false_ones = observed_nonzeros - expected_true_ones;
}

parameters {
  real<lower = 0.05, upper = 0.7> theta;  // probability that an image contains a chair. 
//We put an upper-bound constraint on the prior because if theta is too large there won't be enough true zeros to accommodate all the misclassifications.
}

model {
  for(i in 1:N){
    if(y[i] == 0){
      target += log1m(theta) + // generative process yields zero
        log(observed_zeros / expected_true_zeros); // we don't see a false one.
    } else {
      real ll_given_false_one = log1m(theta) + // generative process yields zero
        log(expected_false_ones / expected_true_zeros); // we don't get a false one
      real ll_given_true_one = log(theta); // generative process yields a one 
      target += log_sum_exp(ll_given_false_one, ll_given_true_one);
    }
  }
}

Now I’m not claiming that the above model is the correct on, but as we’ll see it seems to be at least pretty close.

Now let’s consider how to simulate data under the model. First we pick an arbitrary value for \rho, since we don’t have a generative model for \rho. Next we simulate the data under a standard logistic regression (here assumed to be intercept only, so we just need to specify a single probability theta). Next we replace all of the ones with \rho. And finally we change, completely at random, some of the zeros to \rho as well, representing the misclassifications. How many zeros do we change? Well, I think we want something like this, which gives a generator compatible with the SBC function:

extended_binomial_rng <- function(p, s) {
# a function that generates random numbers from the number of trials associated with 
# a binomial sample and a success probability, with a flat prior on the number of trials.
  assertthat::assert_that(is.numeric(p) & length(p) == 1)
  assertthat::assert_that(is_one_pos_int(s))
  assertthat::assert_that(p > 0 & p <= 1)
  
  r <- stats::runif(1)
  c <- 0
  y <- s - 1
  while(c < r){
    y <- y + 1
    c <- c + p * stats::dbinom(s, y, p) # the normalizing constant is 1/p
  }
  y
}

bmc_generator_single <- function(N, rho = .9, theta_min = 0.05, theta_max = .7){  
  # N is the number of data points we are generating;
  #   rho is the probability that any given observed one is a true one.
  
  # sample the bernoulli probability theta from its uniform prior
  # the prior must be chosen so that it doesn't approach one too closely,
  # which would risk needing to sample in more false ones than there are 
  # zeros to falsify.
  theta <- runif(1, theta_min, theta_max)
  
  # generate the true data, and initialize a container for the observed data
  true_data <- obs_data <- rbinom(N, 1, theta)
  
  # number of underlying ones, number of underlying zeros, 
  #   number of false ones
  n_one_true <- sum(true_data)
  n_zero_true <- N - n_one_true
  n_one_false <- extended_binomial_rng(rho, n_one_true) - n_one_true
  
  # update some of the true zeros in obs data
  f_inds <- sample(seq_len(n_zero_true), n_one_false)
  obs_data[which(obs_data == 0)[f_inds]] <- 1
  
  # encode obs_data as the probability that the observation reflects
  # a true one.
  obs_data[obs_data == 1] <- rho
  
  # format for return
  list(
    variables = list(
      theta = theta
    ),
    generated = list(
      N = N,
      obs = obs_data
    )
  )
}

When I hook this generator and stan model together in simulation-based calibration, the results look pretty good when \rho is high. Here’s \rho = 0.9, for example:

However, moving down to lower \rho reveals a problem. Here’s \rho = 0.4 with the prior on theta ajusted to be uniform on (0.05, 0.3).


This shows that the inference is too uncertain (or that the simulation is too certain) in this setting.

I would like to avoid changing my stan model to explicitly match the simulation because doing so would require marginalizing over all possible numbers of misclassified images. Also, note that I don’t have confidence that my simulation definitely captures the characteristics of the dataset. In fact, I think it’s possible that the Stan model captures what I want and my simulation is bad.

Can anybody help me to appreciate why the simulation and the model differ, suggest a tweak to one or the other to bring it better in line with my modeling setting, and/or suggest an alternative model that might be appropriate for data like these? Maybe @Bob_Carpenter?

A final note: I’m aware that in the toy example above, I could just use the manual within-sample validation to constrain theta and be done with it. However, my hope is to extend an approach to the setting where the logistic regression contains covariates to be estimated from data.

1 Like