Training a classifier with a probabilistically weighted corpus

I’ve gone back to working on crowdsourcing problems now that they’re relevant to transformers. One of the things that we know about training classifiers is that it works better if you have not just a categorical observation but also the probabilities of the different categories of output. Not surprising as the probabilities are way more information than a single draw—it’s just common sense (derived from the Rao-Blackwell theorem!). This motivates using a soft corpus, like you get out of a crowdsourcing model, rather than a gold standard.

As an appendix to a paper we’re about to submit to ICML on adding difficulty and discrimination to Dawid-Skene models, we wanted to motivate probabilistic training. So I wrote a quick case study with simulated data that I can share here:

  • Bob Carpenter. 2023. Training a classifier with a probabilistic data set: Discrete and weighted training with Byes and maximum likelihood. DRAFT!
    carpenter-prob-training-classifier.html (1.9 MB)

It shows how to do soft, weighted training with Stan, and evaluates both MLE and Bayesian inference.

I’m keen to hear any feedback anyone has to make this point more clearly. I found it surprisingly hard to write this up clearly and am not sure I’ve succeeded.

4 Likes

Hi Bob, what confuses me in this case study is that the data are assumed (I think) to consist of the true outcome probability conditional on the covariates (section 3.3). But to obtain these data, we must have some way to estimate the outcome probability conditional on the covariates. But if we already have that, then why do we need a logistic regression to estimate… the outcome probability conditional on covariates?

I might be off the mark here, but if so hopefully it points the way towards some useful clarifications to be made in the case study.

One unrelated point that I think also could be confusing is that you assume different generative processes in the various options that you present. In particular, the third option contains a random row effect in the logistic regression. This makes it hard to understand the differences in performance between the various techniques. I recognize the inherent challenge here: with just one bernoulli outcome per row, estimating the random row effect leads to a degenerate model, but without the random row effect the linear regression on the true log odds is singular. Maybe you could obtain a more fair apples-to-apples comparison by assuming the row effect standard deviation is known and fixed in all cases, so that you can straightforwardly treat the sampled case as a single sample from the log-odds used in the linear-regression-on-log-odds case?

Edit: for what it’s worth, one line of thought that makes the connection between the linear regression on log-odds and the single-outcome model clear is that the former is the large-trials limit of binomial regression with a logit link and a random row effect. The latter can then be understood as starting from the former and at random throwing out all of the binomial trials except for one (and also throwing out the row effect). The “most-probably-category model” can be understood as starting from the former and throwing out all of the trials except for one, NOT at random.