Binomial model with prior information on sensitivity and specificity

Hi all,

I am modeling binary classifications from a machine-learning classifier dependent on a few discrete covariates (for context, see below). I also have data from a classifier validation in the form of a confusion table.

The basic idea came from posts of @mitzimorris and @Bob_Carpenter in a thread about an unrelated topic. As in their example, I want to implement the model using brms for reasons of convenience, but the question is not about brms. I am open to implementing the model directly in Stan if this makes it easier.

However, instead of providing sensitivity and specificity as fixed data values, I want to provide beta priors with counts from the confusion table as shape parameters. Changing @mitzimorris’ code for this purpose is straightforward. The important part is

# define a *vectorized* custom family (no loop over observations)
binomial_sens_spec_vec <- custom_family(
  "binomial_sens_spec", dpars = c("mu", "sens", "spec"),
  links = c("logit", "identity", "identity"),
  lb = c(NA, 0, 0), ub = c(NA, 1, 1), 
  type="int", vars= c("trials"), loop = FALSE
)

# define the corresponding Stan density function
stan_density_binomial_sens_spec_vec <- "
real binomial_sens_spec_lpmf(array[] int y, vector mu, real sens, real spec, array[] int N) {
  return binomial_lpmf(y | N, mu * sens + (1 - mu) * (1 - spec));
}
"

prior_sens = paste0("beta(",cm[2,2], ",", cm[1,2], ")")
prior_spec = paste0("beta(",cm[1,1], ",", cm[2,1], ")")
priors = c(
  ...
  set_prior(prior_sens, class = "sens"),
  set_prior(prior_spec, class = "spec")
)


m = brm(
      value1 | trials(n) ~ Org + ...,
      data = dd,
      family = binomial_sens_spec_vec,
      prior = priors,
      stanvars = stanvars_binomial_sens_spec_vec,
      ...
)

where cm is the confusion matrix.

My reasoning behind the idea of using priors instead of data for sensitivity and specificity is that I want to propagate the uncertainty from the validation study of the classifier to the analysis of the main results. Validation studies in this area are usually not super big (maybe 1,000 test cases), so there is often plenty of uncertainty in the estimates of classifier performance.

The approach seems to work well as long as the analysis data does not become too big. The model is able to recover known parameters in simulations. However, in my actual application, in which I analyze about 400,000 cases, the information in the data completely overwhelms the information in the prior, leading to estimates for sensitivity and specificity that are far away from the observed values in the validation study. I understand why that happens —these values for sensitivity and specificity fit the analysis data better—, but it does not make sense substantially, because there is no new information about the quality of the classifications in the analysis data.

Am I missing anything obvious here? Is it my misconception that one could bring information about sensitivity and specificity into a binomial model in this way? Is there any other way to set up the model so that the prior informs sensitivity and specificity, which are used in estimating the model, but model estimation does not inform sensitivity and specificity? Any recommendations are highly appreciated.

For context:
We are analyzing gender bias in questions asked in post-match press conferences in professional tennis. One of the outcomes is, for example, whether the question is actually about tennis, with the assumption being that female players are asked more about other topics and, consequently, less about tennis. The analysis data comprises about 400,000 questions. The predictor of interest is player gender. In addition, we consider tournament, press conference, player, and year in a hierarchical model.
We use a BERT-NLI zero-shot model for classification. We evaluated the quality of the classifications in a separate validation study with about 1,000 questions. For reference, the confusion matrix for the “tennis” category looked like this:

Column 1 Column 2 Column 3 Column 4
Truth: No tennis Truth: Tennis
Prediction: No tennis 454 67
Prediciton: Tennis 109 307

I think your use case is the same as the one for which this model was originally developed, which was to use prior information on test sensitivity and specificity (in the form of the data supplied by the test manufacturer, which is small), to assess the sensitivity and specificity of test results from a testing facility whose test calibration information is not available.

this is the paper: Bayesian Analysis of Tests with Unknown Specificity and Sensitivity | Journal of the Royal Statistical Society Series C: Applied Statistics | Oxford Academic

Section 4 of this paper shows how the hyperpriors on the sensitivity and specificity affect the analysis.

take a look at this repo, which has a case study and talk slides: GitHub - bob-carpenter/diagnostic-testing: statistical models to analyze diagnostic tests

1 Like

This often happens when you have two models which are not compatible with each other and you trust one of them more. What BUGS did was let you “cut” parameters. So you could add the uncertainty for sensitivity and specificity and have that propagate through fits without having information flow back to sensitivity and specificity. This is very common in PK/PD models in pharmacology where the PK model (e.g., how the metabolism breaks down a drug) is well understood and tight and the PD model (e.g., how the drug concentration affects a disease) is ad hoc and not very well specified. But people want to propagate the uncertainty from a careful PK study into the PD model without letting the PD model distort the posterior of the PK model.

Alas, we have no way to do that in Stan other than multiple imputation. When you think about it, cut in BUGS amounts to a form of multiple imputation. By that, I mean simulate several values for sensitivity and specificity, then fix them for inference in the bigger model. Then combine all the inferences in the bigger model for different values of sensitivity and specificity together. It’s not fully Bayesian, but it stops the bigger model from distorting the base model.

3 Likes

Thank you, @mitzimorris, for clarifying the original purpose of the code and the literature. It is super interesting how the infection testing and content analysis models converge. Two quick follow-ups on the project if you have the time.

  • Did you publish the project using the custom brms code somewhere? I would be very interested to see this work in context.
  • Did you also consider the case in which not only the response but also a predictor was measured with misclassification error? That would be the next step for me.

Thank you, @Bob_Carpenter, for the additional context on the conceptual issue. It is unfortunate that there is no easy solution within Stan (MI is, of course, fine, but it adds considerably to computation time). Still, I am reassured by the explanation that this is, indeed, a feature of the fully Bayesian approach and not something I have overlooked.

And in case anyone ever stumbles across the thread: I observed that using (very) tight priors for sensitivity and specificity instead of completely fixed data values, e.g., beta(30700, 6700) for sensitivity in the example from the original post, leads to more efficient estimation in terms of computing time with substantially similar results, at least in my case.