Training a classifier with a probabilistic data set:
Discrete and weighted training with Bayes and maximum likelihood
Author
Bob Carpenter
Published
October 16, 2023
1 Introduction
This short technical note evaluates several approaches to training logistic regression models when the probabilities of categorical outcomes are provided as data. The motivating example is the probability estimated by a model of data rating (also known as coding, annotation, labeling, and crowdsourcing), such as that of Dawid and Skene (1979). Three out of five dentists asking as raters might say an X-ray shows a cavity and a data rating model taking into account those dentist’s biases and accuracy to infer an 85% chance that the X-ray shows a cavity.
The first result is that to create a corpus with a discrete outcome per item, it is more effective to sample the category for each item given its probability distribution than it is to assign it the most probable category. Approaches to crowdsourcing using a majority vote approach are instances of the suboptimal most-probable strategy.
The second result is that it is even better to use the category probabilities directly, training the classifier with all categories weighted by their probabilities. This result follows from the Rao-Blackwell theorem in the Bayesian setting.
The third result is that the best approach is to transform the probabilities to their log odds and then fit a linear regression. This result follows because logistic regression implies a log odds link and binomial sampling.
Evaluation is in terms of two proper scoring metrics, square error in parameter estimation and held-out data log likelihood. Two widely used estimators are compared, a Bayesian posterior mean estimator and a frequentist penalized maximum likelihood estimator. Bayesian posterior predictive inference is also compared to inference based by plugging in a frequentist point estimate.
With simulated data, all three results hold for parameter estimation and predictive inference with both Bayesian and frequentist approaches. In situations where the data follows the model’s data generating process, the Bayesian approach modestly outperforms the frequentist approach. This last result is not surprising given that Bayesian estimates are designed to minimize expected square error.
2 Logistic regression
Logistic regression applies in the situation where there are \(N\) binary observations \(y_n \in \{ 0, 1 \}\) and each observation comes with a (row) vector \(x_n \in \mathbb{R}^D\) of covariates (also called features or predictors). Logistic regression is a generalized linear model, with a parameter vector \(\beta \in \mathbb{R}^D\) of regression coefficients (also called weights), the link function is log odds (also known as logit), and the family is binomial, so the sampling distribution is \[
Y_n \sim \textrm{binomial}(\textrm{logit}^{-1}(x_n \cdot \beta)),
\] where \(\textrm{logit}:(0, 1) \rightarrow \mathbb{R}\) and its inverse \(\textrm{logit}^{-1}:\mathbb{R} \rightarrow (0, 1)\) are defined by \[
\textrm{logit}(u) = \log \frac{u}{1 - u}
\qquad
\textrm{logit}^{-1}(v) = \frac{1}{1 + \exp(-v)}.
\] The definition of the binomial distribution implies \[
\Pr[Y_n = 1 \mid X_n = x_n] = \textrm{logit}^{-1}(x_n \cdot \beta).
\]
A digression on “discriminative” models
In machine learning, logistic regression models are often called “discriminative” because they do not typically provide a data-generating process for the covariates. In a regression, the process through which the covariates is generated does not affect the coefficient estimates (see section 14.1, page 354, of Gelman et al. (2013)). With a model \(p(x \mid \phi)\) for the covariates, the joint model factors \[
p(x, y, \beta, \phi) = p(x \mid \phi) \cdot p(y \mid x, \beta),
\] and thus so does the posterior for the regression coefficients, \[
p(\beta \mid x, y) \propto p(\beta) \cdot p(y \mid x, \beta).
\] A model \(p(x \mid \phi)\) of the covariates, if available, can be used to handle missing data.
3 Data simulation
For simplicity, consider simulating the parameter vector \(\beta \in \mathbb{R}^D\) as standard normal, \[
\beta \sim \textrm{normal}(0, 1).
\]
Given a data size \(N\), generate a covariate matrix \(x \in \mathbb{R}^{N \times D}\) by taking \[
x_n \sim \textrm{multi-normal}(0, \Sigma),
\] where \(\Sigma \in \mathbb{R}^{D \times D}\) is a full rank covariance matrix (i.e., it is symmetric and positive definite). The first column will then be set to 1 to act as an intercept.
To introduce correlation among the predictors, let \(\Sigma\) be the unit scale random-walk covariance matrix defined by a correlation value \(\rho \in (-1, 1)\) by \[
\Sigma_{i, j} = \rho^{| i - j |}.
\] For example, with \(D = 20\) and \(\rho = 0.9\), the first row of the covariance matrix is \[
\Sigma_{1, 1:20} =
\begin{bmatrix}
1.00 & 0.90 & 0.81 & 0.73 & 0.66 & \cdots & 0.19 & 0.17 & 0.15 & 0.14
& 0.12
\end{bmatrix}.
\]
3.1 Follow the data generating process
Following the true data generating process, outcomes are sampled from a binomial distribution after applying the inverse log odds function, \[
Y_n \sim \textrm{binomial}\!\left(\textrm{logit}^{-1}(x_n \cdot \beta)\right).
\]
3.2 Choose the most probable outcome
A common approach in machine learning to deal with crowdsourcing with multiple data coders is to take a majority vote or by the most probable outcome in the probabilistic model. This corresponds to setting \[
Y_n =
\begin{cases}
1 & \textrm{if } \Pr[Y_n = 1 \mid X_n = x_n] > \frac{1}{2}, \textrm{
and}
\\[4pt]
0 & \textnormal{otherwise}.
\end{cases}
\] The name \(Y_n\) is the same as in the previous section, but this is a different random variable that is used as an alternative to the previous definition (hence the same notation). The variable \(Y_n\) does not follow the logistic regression data generating process, which leads to poor performance.
3.3 Outcome probabilities
A probabilistic corpus assigns each outcome a probability \[
p_n = \Pr[Y_n = 1 \mid X_n = x_n].
\] Direct probability estimates support weighted logistic regression training and also linear regression on the log odds (both of which are defined below).
4 Priors, penalties, and objectives
4.1 Bayesian prior
To complete the Bayesian model, which is a joint probability function \(p(y, \beta),\) take independent standard normal priors for the coefficients \(d \in 1{:}D,\)\[
\beta_d \sim \textrm{normal}(0, 1).
\] This prior matches the data generating process so that the full joint model is well specified for the data.
The full joint Bayesian probability function is defined by combining the prior and sampling distributions, \[\begin{align}
p(y, \beta)
&= p(y \mid \beta) \cdot p(\beta).
\\[4pt]
&= \prod_{n=1}^N \textrm{bernoulli}(y_n \mid \textrm{logit}^{-1}(x_n \cdot
\beta))
\cdot \prod_{d=1}^D \textrm{normal}(\beta_d \mid 0, 1).
\end{align}\]
Given the prior and sampling distribution, the posterior distribution is \[
p(\beta \mid y) \propto p(y \mid \beta) \cdot p(\beta).
\]
4.2 Frequentist penalty function
The Bayesian sampling distribution is used to define the frequentist log likelihood function, \[\begin{align}
\mathcal{L}(\beta)
&= \log \left( \prod_{n=1}^N \textrm{bernoulli}(y_n \mid \textrm{logit}^{-1}(x_n
\cdot \beta)) \right)
\\[4pt]
&= \sum_{n=1}^N \log \textrm{bernoulli}\!\left(y_n \mid \textrm{logit}^{-1}(x_n \cdot \beta)\right).
\end{align}\]
To mirror the Bayesian analysis, assume a quadratic penalty function, \[
\mathcal{P}(\beta) = \frac{1}{2} \beta^\top \beta.
\] This penalty is called an \(L^2\) penalty because it’s based on the (half of the squared) \(L^2\)-norm of \(\beta\) (i.e., its squared Euclidean length). Using an \(L^2\) penalty function for penalized maximum likelihood estimation is known as ridge regression. The \(L^2\) penalty shrinks estimates toward zero by penalizing the components of \(\beta\) quadratically.
The objective function for frequentist estimation is known as the penalized maximum likelihood function and is defined by \[
\mathcal{F}(\beta) = \mathcal{L}(\beta) - \mathcal{P}(\beta).
\] Frequentist estimation proceeds by optimization over the objective.
4.3 Frequentist and Bayesian objectives are the same
The frequentist objective function is just the log posterior plus a constant, \[
\mathcal{F}(\beta) = \log p(\beta \mid y) + \textrm{const}.
\] The constant, which is not necessary for Bayesian inference using Markov chain Monte Carlo sampling, is \(\log p(y).\)
This relation does not imply that the Bayesian and frequentist estimates will be the same. Full Bayesian inference will average over uncertainty rather than optimize.
5 Estimation
For each observation \(n \in 1{:}N,\) assume a covariate (row) vector \(x_n \in \mathbb{R}^D\) (so that \(x \in \mathbb{R}^{N \times D}\) is an \(N \times D\) matrix.
5.1 Estimation from training data
The first two cases of simulation, sampling from the data generating process and assigning the most probable category involve data \(y_n \in \{ 0, 1 \}\) for \(n \in 1{:}N.\)
Bayesian estimate from training data
Given a data set \((x, y),\) the Bayesian parameter estimate that minimizes expected square error is the posterior mean, \[\begin{align}
\widehat{\beta}
&= \mathbb{E}[\beta \mid x, y]
\\[4pt]
&= \int_{\mathbb{R}^D} \beta \cdot p(\beta \mid x, y) \, \textrm{d}\beta.
\end{align}\] Markov chain Monte Carlo methods produce an (approximately) identically distributed (but not independent) sample \[
\beta^{(1)}, \ldots, \beta^{(M)} \sim p(\beta \mid x, y),
\] and estimate the expectation by averaging the draws, setting \[
\widehat{\beta}
\approx
\frac{1}{M} \sum_{m=1}^M \beta^{(m)}.
\] As \(M \rightarrow \infty,\) the Monte Carlo approximation becomes exact in the limit. With \(M = 1,000,\) and not too much correlation, there is usually around 2 digits of accuracy.
Frequentist estimate from data
The frequentist estimate is the penalized maximum likelihood estimate, \[
\beta^* = \textrm{arg max}_\beta \ \mathcal{F}(\beta).
\] Because of the relation between the Bayesian and frequentist objectives for this problem, this estimate is also called the maximum a posterior estimate, because \[
\beta^* = \textrm{arg max}_\beta \ p(\beta \mid x, y)
\]
5.2 Probability-weighted estimation
In probability weighted training, the objective function needs to be modified to accept probabilities \(p_n \in (0, 1)\) for outcomes \(n \in 1{:}N.\)
Bayesian objective
The Bayesian objective is defined by weighting the outcomes probabilistically, \[
p(\beta \mid x, p)
\propto p(\beta)
\cdot \prod_{n=1}^N
\strut p(Y_n = 1 \mid x_n, \beta)^{p_n}
\cdot p(Y_n = 0 \mid x_n, \beta)^{1 - p_n},
\] which on the log scale is \[
\log p(\beta \mid x, p)
= + \textrm{const}
+ \log p(\beta)
+ \sum_{n=1}^N \strut p_n \log p(Y_n = 1 \mid x_n, \beta)
+ (1 - p_n) \log p(Y_n = 0 \mid x_n, \beta)
\]
This objective is known as the Rao-Blackwellized form of our original objective. The Rao-Blackwell theorem entails that working in expectation, as we are doing here, perfroms no worse than sampling in terms of estimation error; in practice, Rao-Blackwellizing usually brings substantial gains, especially in the context of discrete sampling as we are doing here.
For estimation, we use posterior means as before, swapping in this objective for the previous one.
Frequentist objective
The frequentist objective here mirrors the Bayesian objective, \[
\mathcal{F}(\beta)
= - \frac{1}{2}\beta^2
+ \sum_{n=1}^N \strut p_n \log p(Y_n = 1 \mid x_n, \beta)
+ (1 - p_n) \log p(Y_n = 0 \mid x_n, \beta)
\]
This objective is optimized for estimation.
5.3 Regression on the log odds
The final estimation technique involves recognizing that when working in expectation, the log odds link function can be used to transform the probabilistic data to the log odds scale, at which point estimation can proceed via linear regression. Because the location and scale parameters in a linear regression are conditionally independent given the data, the scale parameter is fixed to unity.
Bayesian regression on log odds
The Bayesian objective function is defined by the data generating process \[
\textrm{logit}(p_n) \sim \textrm{normal}(x_n \cdot \beta, 1),
\] which leads to the objective function \[
p(\beta \mid x, p)
\propto \left( \prod_{d=1}^D \textrm{normal}(\beta_d \mid 0, 1) \right)
\cdot \prod_{n=1}^N \textrm{normal}(\textrm{logit}(p_n) \mid x_n \cdot \beta, 1).
\] On the log scale, that’s \[
\log p(\beta \mid x, p)
= \left( \sum_{d=1}^D \log \textrm{normal}(\beta_d \mid 0, 1) \right)
+ \sum_{n = 1}^N \log \textrm{normal}(\textrm{logit}(p_n) \mid x_n
\cdot \beta, 1)
+ \textrm{const}.
\]
For estimation, we use posterior means as before, swapping in this objective for the previous one.
Frequentist regression on log odds
The frequentist objective function is the same up to a constant, \[
\mathcal{F}(\beta)
= - \frac{1}{2} \beta^\top \cdot \beta.
+ \left( \sum_{n=1}^N \log \textrm{normal}(\textrm{logit}(p_n) \mid
x_n \cdot \beta, 1) \right)
\] This objective is optimized for estimation.
6 Predictive inference
After fitting a logistic regression model, it can be used for prediction. Specifically, it can be used to transform a vector of covariates for a new item to a probabilistic prediction of its category.
6.1 Bayesian posterior predictive inference
Suppose the training data is a sequence of pairs \((x_n, y_n)\) where \(x_n \in \mathbb{R}^D\) and \(y_n \in \{ 0, 1 \}\) for \(n \in 1{:}N,\) and \(\beta \in \mathbb{R}^D.\) Now suppose there are new items with covariates \(\widetilde{x}_n\) for \(n \in 1{:}\widetilde{N}.\) For a new predictor matrix \(\widetilde{x}_n,\) and array of outcomes \(\widetilde{y}_n \in \{ 0, 1 \}\), the posterior predictive probability mass function for the outcomes is \[\begin{align*}
p(\widetilde{y} \mid \widetilde{x}, x, y)
&= \mathbb{E}\!\left[\widetilde{Y} \,\big|\, \widetilde{x}, x, y\right]
\\[4pt]
&= \int_{\mathbb{R}^D} p(\widetilde{y} \mid \widetilde{x}, \beta) \cdot p(\beta \mid
x, y) \ \textrm{d}\beta.
\end{align*}\]
With a sample of MCMC draws \[
\beta^{(1)}, \ldots, \beta^{(M)} \sim p(\beta \mid x, y)
\] for \(m \in 1{:}M,\) this quantity can be estimated as \[
p(\widetilde{y} \mid \widetilde{x}, x, y)
\approx
\frac{1}{M} \sum_{m = 1}^M p\!\left(\widetilde{y} \,\big|\, \widetilde{x},
\beta^{(m)}\right).
\]
Because of issues with floating-point numbers on computers, these calculations must be done on the log scale to guard against underflow in density functions, \[\begin{align*}
\log p(\widetilde{y} \mid \widetilde{x}, x, y)
&\approx
\log \frac{1}{M} \sum_{m = 1}^M p\!\left(\widetilde{y} \,\big|\, \widetilde{x},
\beta^{(m)}\right)
\\[4pt]
&= -\log M + \log \sum_{m = 1}^M p\!\left(\widetilde{y} \,\big|\, \widetilde{x},
\beta^{(m)}\right)
\\[4pt]
&= -\log M + \log \sum_{m = 1}^M \exp\!\left(\log \ p\!\left(\widetilde{y} \,\big|\, \widetilde{x},
\beta^{(m)}\right)\right)
\\[4pt]
&= -\log M + \textrm{logSumExp}_{m=1}^M \log \ p\!\left(\widetilde{y} \,\big|\, \widetilde{x},
\beta^{(m)}\right).
\end{align*}\]
The log-sum-of-exponentials function can be implemented in an arithmetically stable way for a vector \(v \in \mathbb{R}^M\) as \[\begin{align*}
\textrm{logSumExp}_{m=1}^M v_m
&= \log \sum_{m = 1}^M \exp(v_m)
\\
&= \max(v) + \log \sum_{j=1}^K \exp(v_j - \max(v)).
\end{align*}\] The stability follows from never applying \(\exp()\) to a positive number combined with pulling the leading digits out with \(\max(v).\)
Typically, errors are reported in terms of the expected log posterior density (ELPD) per item, which for \(\widetilde{y} \in \{0, 1\}^\widetilde{N}\) and \(\widetilde{x} \in \mathbb{R}^{N \times D},\) is \[
\textrm{ELPD} = \frac{\log p(\widetilde{y} \mid \widetilde{x}, x, y)}
{\widetilde{N}}.
\]
6.2 Frequentist plug-in inference
Standard practice in machine learning fits a penalized maximum likelihood estimate \[
\beta^*
= \textrm{arg max}_\beta \
\log p(y \mid x, \beta) - \mathcal{P}(\beta),
\] which is plugged in for prediction, \[
p(\widetilde{y} \mid \widetilde{x}, x, y)
\approx
p(\widetilde{y} \mid \widetilde{x}, \beta^*).
\]
7 Simulation-based evaluation
7.1 Stan models
There are three objective functions in play, based on whether the data under consideration is given as discrete outcomes \(y_n\) or probabilities \(p_n\), and in the case of probabilities, whether they are transformed to log odds for a linear regression.
Stan program for logistic regression
The Stan program for logistic regression implements the following log posterior up to an additive constant, \[
\log p(\beta \mid x, y)
= \log p(\beta)
+ \sum_{n=1}^N \textrm{bernoulli}(\textrm{logit}^{-1}(x_n \cdot
\beta)).
\]
The Stan program for linear regression on the log odds implements the following log posterior up to an additive constant. \[
\log p(\beta \mid x, p)
= \log p(\beta)
+ \sum_{n = 1}^N
\log \textrm{normal}(\textrm{logit}(p_n) \mid x_n \cdot \beta, 1)
+ \textrm{const}.
\]
log-odds-linear-regression.stan
data {int<lower=1> D; // num predictorsint<lower=0> N; // num observationsmatrix[N, D] x; // predictorsvector<lower=0, upper=1>[N] p; // Pr[Y_n = 1 | x_n]}parameters {vector[D] beta; // regression coefficients}model { logit(p) ~ normal(x * beta, 1); // likelihood: linear regression on log odds beta ~ normal(0, 1); // prior: ridge}
7.2 Evaluation of estimation
The evaluation is coded in Python using the CmdStanPy interface to Stan using NumPy, pandas, and plotnine. The first step is to import these libraries, configure logger to only report errors, and set a random seed for NumPy.
Code
import loggingimport scipy as spimport numpy as npimport pandas as pdimport plotnine as pnimport cmdstanpy as csppd.set_option('display.max_rows', None) # show whole pandas data framescsp.utils.get_logger().setLevel(logging.ERROR) # only log errorsnp.random.seed(12345) # change seed for fresh simulation
Fourth, set all the constants determining sizes for the simulation.
# D = 11, N = 500, rho = 0.9 goodD =32# number of predictors including interceptN =1024# number of data points used to trainrho =0.9# correlation of predictor RW covarianceN_test =1024# number of test itemsM =32# number of simulation runs
Fifth, allocate a pandas data frame to store the results, then collect results M iterations. Within each iteration, generate predictors and the various outcomes randomly. For each of these, calculate the penalized MLE and Bayesian posterior mean and add them to the data frame.
Finally, print the results as a table with a custom reporter for means, 0.1, 0.5 (median), 0.9 quantiles, and the minimum and maximum, rounding to a single decimal place.
The simulations clearly show that training with logit-transformed probabilities is the most effective, followed by noisy logit-transformed probabilities and weighted training, followed by random sampling, and finally, far behind, taking the most probable category.
The penalized maximum likelihood estimator is better at handling the most probable category sampling, but is otherwise similar or slightly trails the Bayesian estimator.
7.3 Evaluation of predictive inference
For evaluation of both fully Bayesian and plug-in inference, the following Stan model suffices.
logistic-regression-predict.stan
data {int<lower=1> D; // num covariates per itemint<lower=0> N; // num observationsmatrix[N, D] x; // test covariatesarray[N] int<lower=0, upper=1> y; // outcomes}parameters {vector[D] beta; // parameters}generated quantities {real log_p = bernoulli_logit_lpmf(y | x * beta); // likelihood}
The data includes both the test covariates and the test outcomes. The parameters are the same as in the models for training. For predictive inference, the model block is replaced with a generated quantities block that assigns the log density of the data given the parameters to the variable log_p_t.
The evaluation follows the earlier evaluation, only we now measure predictive accuracy rather than estimation accuracy.
After building up the results, the results can be summarized with a pandas one-liner. In what follows, higher ELPD is better—it means assigning a higher probability to the observed outcomes.
Logistic regression:Gelman, Hill, and Vehtari (2020) provide a detailed, practical introduction to regression modeling. For a more Bayesian perspective, see Gelman et al. (2013).
Probabilistic training:Smyth et al. (1994) is the first example of which I’m aware of jointly training a classifier and a crowdsourcing model. Raykar et al. (2010) performs joint training in a logistic regression setting using Dawid and Skene’s model.
Calibration and scoring metrics:Gneiting and Raftery (2007) and Gneiting, Balabdaoui, and Raftery (2007) provide an excellent overview of proper scoring metrics and calibrated prediction, respectively.
Rating models:Dawid and Skene (1979) first modeled crowdsourcing as noisy raters providing measurements conditioned on the true value. In a binary task, each rater’s sensitivity and specificity is estimated, as is the prevalence of successful (1) outcomes, which allows a probabilistic assignment of category to each item. It is straightforward to extend Dawid and Skene’s model to a Bayesian hierarchical model as shown by Paun et al. (2018) and in the “Latent discrete parameters” chapter of the Stan User’s Guide.Passonneau and Carpenter (2014) provide a high-level motivation and analysis of rating versus weighted voting or inter-annotator agreement measures. There is a rich literature on similar models in epidemiology (diagnostic testing without a gold standard), educational testing (exam grading without an answer key), and anthropology (cultural consensus theory).
Stan: The Stan Reference Manual, by the Stan Development Team (2023a), describes the Stan programming language. The User’s Guide, by the Stan Development Team (2023b), describes how to use the Stan language to code statistical models, including both hierarchical logistic regression and the Dawid and Skene crowdsourcing model.
9 Conclusions
If you have a data set with probabilistic labels, the best thing to do is to train with those probabilities. In the case of logistic regression, you can go one step better by log odds transforming the probabilities and directly training a linear regression. If you are restricted to estimation (training) software that can only handle deterministic categories, then the best thing to do is to sample the categories randomly given the probabilities. Whatever you do, do not take the most probable category or use majority voting for crowdsourcing.
Dawid, Alexander Philip, and Allan M Skene. 1979. “Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.”Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1): 20–28.
Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. Third Edition. CRC Press.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.
Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.”Journal of the Royal Statistical Society Series B: Statistical Methodology 69 (2): 243–68.
Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.”Journal of the American Statistical Association 102 (477): 359–78.
Passonneau, Rebecca J, and Bob Carpenter. 2014. “The Benefits of a Model of Annotation.”Transactions of the Association for Computational Linguistics 2: 311–26.
Paun, Silviu, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, and Massimo Poesio. 2018. “Comparing Bayesian Models of Annotation.”Transactions of the Association for Computational Linguistics 6: 571–85.
Raykar, Vikas C, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. “Learning from Crowds.”Journal of Machine Learning Research 11 (4).
Smyth, Padhraic, Usama Fayyad, Michael Burl, Pietro Perona, and Pierre Baldi. 1994. “Inferring Ground Truth from Subjective Labelling of Venus Images.”Advances in Neural Information Processing Systems 7.