Training a classifier with a probabilistic data set:

Discrete and weighted training with Bayes and maximum likelihood

Author

Bob Carpenter

Published

October 16, 2023

1 Introduction

This short technical note evaluates several approaches to training logistic regression models when the probabilities of categorical outcomes are provided as data. The motivating example is the probability estimated by a model of data rating (also known as coding, annotation, labeling, and crowdsourcing), such as that of Dawid and Skene (1979). Three out of five dentists asking as raters might say an X-ray shows a cavity and a data rating model taking into account those dentist’s biases and accuracy to infer an 85% chance that the X-ray shows a cavity.

The first result is that to create a corpus with a discrete outcome per item, it is more effective to sample the category for each item given its probability distribution than it is to assign it the most probable category. Approaches to crowdsourcing using a majority vote approach are instances of the suboptimal most-probable strategy.

The second result is that it is even better to use the category probabilities directly, training the classifier with all categories weighted by their probabilities. This result follows from the Rao-Blackwell theorem in the Bayesian setting.

The third result is that the best approach is to transform the probabilities to their log odds and then fit a linear regression. This result follows because logistic regression implies a log odds link and binomial sampling.

Evaluation is in terms of two proper scoring metrics, square error in parameter estimation and held-out data log likelihood. Two widely used estimators are compared, a Bayesian posterior mean estimator and a frequentist penalized maximum likelihood estimator. Bayesian posterior predictive inference is also compared to inference based by plugging in a frequentist point estimate.

With simulated data, all three results hold for parameter estimation and predictive inference with both Bayesian and frequentist approaches. In situations where the data follows the model’s data generating process, the Bayesian approach modestly outperforms the frequentist approach. This last result is not surprising given that Bayesian estimates are designed to minimize expected square error.

2 Logistic regression

Logistic regression applies in the situation where there are \(N\) binary observations \(y_n \in \{ 0, 1 \}\) and each observation comes with a (row) vector \(x_n \in \mathbb{R}^D\) of covariates (also called features or predictors). Logistic regression is a generalized linear model, with a parameter vector \(\beta \in \mathbb{R}^D\) of regression coefficients (also called weights), the link function is log odds (also known as logit), and the family is binomial, so the sampling distribution is \[ Y_n \sim \textrm{binomial}(\textrm{logit}^{-1}(x_n \cdot \beta)), \] where \(\textrm{logit}:(0, 1) \rightarrow \mathbb{R}\) and its inverse \(\textrm{logit}^{-1}:\mathbb{R} \rightarrow (0, 1)\) are defined by \[ \textrm{logit}(u) = \log \frac{u}{1 - u} \qquad \textrm{logit}^{-1}(v) = \frac{1}{1 + \exp(-v)}. \] The definition of the binomial distribution implies \[ \Pr[Y_n = 1 \mid X_n = x_n] = \textrm{logit}^{-1}(x_n \cdot \beta). \]

A digression on “discriminative” models

In machine learning, logistic regression models are often called “discriminative” because they do not typically provide a data-generating process for the covariates. In a regression, the process through which the covariates is generated does not affect the coefficient estimates (see section 14.1, page 354, of Gelman et al. (2013)). With a model \(p(x \mid \phi)\) for the covariates, the joint model factors \[ p(x, y, \beta, \phi) = p(x \mid \phi) \cdot p(y \mid x, \beta), \] and thus so does the posterior for the regression coefficients, \[ p(\beta \mid x, y) \propto p(\beta) \cdot p(y \mid x, \beta). \] A model \(p(x \mid \phi)\) of the covariates, if available, can be used to handle missing data.

3 Data simulation

For simplicity, consider simulating the parameter vector \(\beta \in \mathbb{R}^D\) as standard normal, \[ \beta \sim \textrm{normal}(0, 1). \]

Given a data size \(N\), generate a covariate matrix \(x \in \mathbb{R}^{N \times D}\) by taking \[ x_n \sim \textrm{multi-normal}(0, \Sigma), \] where \(\Sigma \in \mathbb{R}^{D \times D}\) is a full rank covariance matrix (i.e., it is symmetric and positive definite). The first column will then be set to 1 to act as an intercept.

To introduce correlation among the predictors, let \(\Sigma\) be the unit scale random-walk covariance matrix defined by a correlation value \(\rho \in (-1, 1)\) by \[ \Sigma_{i, j} = \rho^{| i - j |}. \] For example, with \(D = 20\) and \(\rho = 0.9\), the first row of the covariance matrix is \[ \Sigma_{1, 1:20} = \begin{bmatrix} 1.00 & 0.90 & 0.81 & 0.73 & 0.66 & \cdots & 0.19 & 0.17 & 0.15 & 0.14 & 0.12 \end{bmatrix}. \]

3.1 Follow the data generating process

Following the true data generating process, outcomes are sampled from a binomial distribution after applying the inverse log odds function, \[ Y_n \sim \textrm{binomial}\!\left(\textrm{logit}^{-1}(x_n \cdot \beta)\right). \]

3.2 Choose the most probable outcome

A common approach in machine learning to deal with crowdsourcing with multiple data coders is to take a majority vote or by the most probable outcome in the probabilistic model. This corresponds to setting \[ Y_n = \begin{cases} 1 & \textrm{if } \Pr[Y_n = 1 \mid X_n = x_n] > \frac{1}{2}, \textrm{ and} \\[4pt] 0 & \textnormal{otherwise}. \end{cases} \] The name \(Y_n\) is the same as in the previous section, but this is a different random variable that is used as an alternative to the previous definition (hence the same notation). The variable \(Y_n\) does not follow the logistic regression data generating process, which leads to poor performance.

3.3 Outcome probabilities

A probabilistic corpus assigns each outcome a probability \[ p_n = \Pr[Y_n = 1 \mid X_n = x_n]. \] Direct probability estimates support weighted logistic regression training and also linear regression on the log odds (both of which are defined below).

4 Priors, penalties, and objectives

4.1 Bayesian prior

To complete the Bayesian model, which is a joint probability function \(p(y, \beta),\) take independent standard normal priors for the coefficients \(d \in 1{:}D,\) \[ \beta_d \sim \textrm{normal}(0, 1). \] This prior matches the data generating process so that the full joint model is well specified for the data.

The full joint Bayesian probability function is defined by combining the prior and sampling distributions, \[\begin{align} p(y, \beta) &= p(y \mid \beta) \cdot p(\beta). \\[4pt] &= \prod_{n=1}^N \textrm{bernoulli}(y_n \mid \textrm{logit}^{-1}(x_n \cdot \beta)) \cdot \prod_{d=1}^D \textrm{normal}(\beta_d \mid 0, 1). \end{align}\]

Given the prior and sampling distribution, the posterior distribution is \[ p(\beta \mid y) \propto p(y \mid \beta) \cdot p(\beta). \]

4.2 Frequentist penalty function

The Bayesian sampling distribution is used to define the frequentist log likelihood function, \[\begin{align} \mathcal{L}(\beta) &= \log \left( \prod_{n=1}^N \textrm{bernoulli}(y_n \mid \textrm{logit}^{-1}(x_n \cdot \beta)) \right) \\[4pt] &= \sum_{n=1}^N \log \textrm{bernoulli}\!\left(y_n \mid \textrm{logit}^{-1}(x_n \cdot \beta)\right). \end{align}\]

To mirror the Bayesian analysis, assume a quadratic penalty function, \[ \mathcal{P}(\beta) = \frac{1}{2} \beta^\top \beta. \] This penalty is called an \(L^2\) penalty because it’s based on the (half of the squared) \(L^2\)-norm of \(\beta\) (i.e., its squared Euclidean length). Using an \(L^2\) penalty function for penalized maximum likelihood estimation is known as ridge regression. The \(L^2\) penalty shrinks estimates toward zero by penalizing the components of \(\beta\) quadratically.

The objective function for frequentist estimation is known as the penalized maximum likelihood function and is defined by \[ \mathcal{F}(\beta) = \mathcal{L}(\beta) - \mathcal{P}(\beta). \] Frequentist estimation proceeds by optimization over the objective.

4.3 Frequentist and Bayesian objectives are the same

The frequentist objective function is just the log posterior plus a constant, \[ \mathcal{F}(\beta) = \log p(\beta \mid y) + \textrm{const}. \] The constant, which is not necessary for Bayesian inference using Markov chain Monte Carlo sampling, is \(\log p(y).\)

This relation does not imply that the Bayesian and frequentist estimates will be the same. Full Bayesian inference will average over uncertainty rather than optimize.

5 Estimation

For each observation \(n \in 1{:}N,\) assume a covariate (row) vector \(x_n \in \mathbb{R}^D\) (so that \(x \in \mathbb{R}^{N \times D}\) is an \(N \times D\) matrix.

5.1 Estimation from training data

The first two cases of simulation, sampling from the data generating process and assigning the most probable category involve data \(y_n \in \{ 0, 1 \}\) for \(n \in 1{:}N.\)

Bayesian estimate from training data

Given a data set \((x, y),\) the Bayesian parameter estimate that minimizes expected square error is the posterior mean, \[\begin{align} \widehat{\beta} &= \mathbb{E}[\beta \mid x, y] \\[4pt] &= \int_{\mathbb{R}^D} \beta \cdot p(\beta \mid x, y) \, \textrm{d}\beta. \end{align}\] Markov chain Monte Carlo methods produce an (approximately) identically distributed (but not independent) sample \[ \beta^{(1)}, \ldots, \beta^{(M)} \sim p(\beta \mid x, y), \] and estimate the expectation by averaging the draws, setting \[ \widehat{\beta} \approx \frac{1}{M} \sum_{m=1}^M \beta^{(m)}. \] As \(M \rightarrow \infty,\) the Monte Carlo approximation becomes exact in the limit. With \(M = 1,000,\) and not too much correlation, there is usually around 2 digits of accuracy.

Frequentist estimate from data

The frequentist estimate is the penalized maximum likelihood estimate, \[ \beta^* = \textrm{arg max}_\beta \ \mathcal{F}(\beta). \] Because of the relation between the Bayesian and frequentist objectives for this problem, this estimate is also called the maximum a posterior estimate, because \[ \beta^* = \textrm{arg max}_\beta \ p(\beta \mid x, y) \]

5.2 Probability-weighted estimation

In probability weighted training, the objective function needs to be modified to accept probabilities \(p_n \in (0, 1)\) for outcomes \(n \in 1{:}N.\)

Bayesian objective

The Bayesian objective is defined by weighting the outcomes probabilistically, \[ p(\beta \mid x, p) \propto p(\beta) \cdot \prod_{n=1}^N \strut p(Y_n = 1 \mid x_n, \beta)^{p_n} \cdot p(Y_n = 0 \mid x_n, \beta)^{1 - p_n}, \] which on the log scale is \[ \log p(\beta \mid x, p) = + \textrm{const} + \log p(\beta) + \sum_{n=1}^N \strut p_n \log p(Y_n = 1 \mid x_n, \beta) + (1 - p_n) \log p(Y_n = 0 \mid x_n, \beta) \]

This objective is known as the Rao-Blackwellized form of our original objective. The Rao-Blackwell theorem entails that working in expectation, as we are doing here, perfroms no worse than sampling in terms of estimation error; in practice, Rao-Blackwellizing usually brings substantial gains, especially in the context of discrete sampling as we are doing here.

For estimation, we use posterior means as before, swapping in this objective for the previous one.

Frequentist objective

The frequentist objective here mirrors the Bayesian objective, \[ \mathcal{F}(\beta) = - \frac{1}{2}\beta^2 + \sum_{n=1}^N \strut p_n \log p(Y_n = 1 \mid x_n, \beta) + (1 - p_n) \log p(Y_n = 0 \mid x_n, \beta) \]
This objective is optimized for estimation.

5.3 Regression on the log odds

The final estimation technique involves recognizing that when working in expectation, the log odds link function can be used to transform the probabilistic data to the log odds scale, at which point estimation can proceed via linear regression. Because the location and scale parameters in a linear regression are conditionally independent given the data, the scale parameter is fixed to unity.

Bayesian regression on log odds

The Bayesian objective function is defined by the data generating process \[ \textrm{logit}(p_n) \sim \textrm{normal}(x_n \cdot \beta, 1), \] which leads to the objective function \[ p(\beta \mid x, p) \propto \left( \prod_{d=1}^D \textrm{normal}(\beta_d \mid 0, 1) \right) \cdot \prod_{n=1}^N \textrm{normal}(\textrm{logit}(p_n) \mid x_n \cdot \beta, 1). \] On the log scale, that’s \[ \log p(\beta \mid x, p) = \left( \sum_{d=1}^D \log \textrm{normal}(\beta_d \mid 0, 1) \right) + \sum_{n = 1}^N \log \textrm{normal}(\textrm{logit}(p_n) \mid x_n \cdot \beta, 1) + \textrm{const}. \]

For estimation, we use posterior means as before, swapping in this objective for the previous one.

Frequentist regression on log odds

The frequentist objective function is the same up to a constant, \[ \mathcal{F}(\beta) = - \frac{1}{2} \beta^\top \cdot \beta. + \left( \sum_{n=1}^N \log \textrm{normal}(\textrm{logit}(p_n) \mid x_n \cdot \beta, 1) \right) \] This objective is optimized for estimation.

6 Predictive inference

After fitting a logistic regression model, it can be used for prediction. Specifically, it can be used to transform a vector of covariates for a new item to a probabilistic prediction of its category.

6.1 Bayesian posterior predictive inference

Suppose the training data is a sequence of pairs \((x_n, y_n)\) where \(x_n \in \mathbb{R}^D\) and \(y_n \in \{ 0, 1 \}\) for \(n \in 1{:}N,\) and \(\beta \in \mathbb{R}^D.\) Now suppose there are new items with covariates \(\widetilde{x}_n\) for \(n \in 1{:}\widetilde{N}.\) For a new predictor matrix \(\widetilde{x}_n,\) and array of outcomes \(\widetilde{y}_n \in \{ 0, 1 \}\), the posterior predictive probability mass function for the outcomes is \[\begin{align*} p(\widetilde{y} \mid \widetilde{x}, x, y) &= \mathbb{E}\!\left[\widetilde{Y} \,\big|\, \widetilde{x}, x, y\right] \\[4pt] &= \int_{\mathbb{R}^D} p(\widetilde{y} \mid \widetilde{x}, \beta) \cdot p(\beta \mid x, y) \ \textrm{d}\beta. \end{align*}\]

With a sample of MCMC draws \[ \beta^{(1)}, \ldots, \beta^{(M)} \sim p(\beta \mid x, y) \] for \(m \in 1{:}M,\) this quantity can be estimated as \[ p(\widetilde{y} \mid \widetilde{x}, x, y) \approx \frac{1}{M} \sum_{m = 1}^M p\!\left(\widetilde{y} \,\big|\, \widetilde{x}, \beta^{(m)}\right). \]

Because of issues with floating-point numbers on computers, these calculations must be done on the log scale to guard against underflow in density functions, \[\begin{align*} \log p(\widetilde{y} \mid \widetilde{x}, x, y) &\approx \log \frac{1}{M} \sum_{m = 1}^M p\!\left(\widetilde{y} \,\big|\, \widetilde{x}, \beta^{(m)}\right) \\[4pt] &= -\log M + \log \sum_{m = 1}^M p\!\left(\widetilde{y} \,\big|\, \widetilde{x}, \beta^{(m)}\right) \\[4pt] &= -\log M + \log \sum_{m = 1}^M \exp\!\left(\log \ p\!\left(\widetilde{y} \,\big|\, \widetilde{x}, \beta^{(m)}\right)\right) \\[4pt] &= -\log M + \textrm{logSumExp}_{m=1}^M \log \ p\!\left(\widetilde{y} \,\big|\, \widetilde{x}, \beta^{(m)}\right). \end{align*}\]

The log-sum-of-exponentials function can be implemented in an arithmetically stable way for a vector \(v \in \mathbb{R}^M\) as \[\begin{align*} \textrm{logSumExp}_{m=1}^M v_m &= \log \sum_{m = 1}^M \exp(v_m) \\ &= \max(v) + \log \sum_{j=1}^K \exp(v_j - \max(v)). \end{align*}\] The stability follows from never applying \(\exp()\) to a positive number combined with pulling the leading digits out with \(\max(v).\)

Typically, errors are reported in terms of the expected log posterior density (ELPD) per item, which for \(\widetilde{y} \in \{0, 1\}^\widetilde{N}\) and \(\widetilde{x} \in \mathbb{R}^{N \times D},\) is \[ \textrm{ELPD} = \frac{\log p(\widetilde{y} \mid \widetilde{x}, x, y)} {\widetilde{N}}. \]

6.2 Frequentist plug-in inference

Standard practice in machine learning fits a penalized maximum likelihood estimate \[ \beta^* = \textrm{arg max}_\beta \ \log p(y \mid x, \beta) - \mathcal{P}(\beta), \] which is plugged in for prediction, \[ p(\widetilde{y} \mid \widetilde{x}, x, y) \approx p(\widetilde{y} \mid \widetilde{x}, \beta^*). \]

7 Simulation-based evaluation

7.1 Stan models

There are three objective functions in play, based on whether the data under consideration is given as discrete outcomes \(y_n\) or probabilities \(p_n\), and in the case of probabilities, whether they are transformed to log odds for a linear regression.

Stan program for logistic regression

The Stan program for logistic regression implements the following log posterior up to an additive constant, \[ \log p(\beta \mid x, y) = \log p(\beta) + \sum_{n=1}^N \textrm{bernoulli}(\textrm{logit}^{-1}(x_n \cdot \beta)). \]

logistic-regression.stan

data {
  int<lower=1> D;                         // num predictors
  int<lower=0> N;                         // num observations
  matrix[N, D] x;                         // covariates
  array[N] int<lower=0, upper=1> y;       // outcomes Y[n] = y[n]
}
parameters {
  vector[D] beta;                         // regression coefficients
}
model {
  y ~ bernoulli_logit(x * beta);     // likelihood: logistic regression
  beta ~ normal(0, 1);               // prior:      ridge
}

Stan program for probability-weighted logistic regression

The Stan program for probability-weighted logistic regression implements the following log posterior up to an additive constant, \[ \log p(\beta \mid x, p) = \log p(\beta) + \sum_{n=1}^N \left( \begin{array}{l} p_n \cdot \log \textrm{bernoulli}(1 \mid \textrm{logit}^{-1}(x_n \cdot \beta)) \\[2pt] \ + (1 - p_n) \cdot \textrm{bernoulli}(0 \mid \textrm{logit}^{-1}(x_n \cdot \beta)) \end{array} \right) + \textrm{const}. \]

weighted-logistic-regression.stan

data {
  int<lower=1> D;                        // num predictors
  int<lower=0> N;                        // num observations
  matrix[N, D] x;                        // covariates
  vector<lower=0, upper=1>[N] p;         // Pr[Y[n] = 1]
}
parameters {
  vector[D] beta;                         // regression coefficients
}
model {
  vector[N] E_Y = inv_logit(x * beta);    // expected Y 
  target += sum(p .* log(E_Y));           // likelihood: weighted logistic regression
  target += sum((1 - p) .* log1m(E_Y));   // 
  beta ~ normal(0, 1);                    // prior:    ridge
}

Stan program for log odds linear regression

The Stan program for linear regression on the log odds implements the following log posterior up to an additive constant. \[ \log p(\beta \mid x, p) = \log p(\beta) + \sum_{n = 1}^N \log \textrm{normal}(\textrm{logit}(p_n) \mid x_n \cdot \beta, 1) + \textrm{const}. \]

log-odds-linear-regression.stan

data {
  int<lower=1> D;                              // num predictors
  int<lower=0> N;                              // num observations
  matrix[N, D] x;                              // predictors
  vector<lower=0, upper=1>[N] p;               // Pr[Y_n = 1 | x_n]
}
parameters {
  vector[D] beta;                     // regression coefficients
}
model {
  logit(p) ~ normal(x * beta, 1);     // likelihood: linear regression on log odds
  beta ~ normal(0, 1);                // prior:      ridge
}

7.2 Evaluation of estimation

The evaluation is coded in Python using the CmdStanPy interface to Stan using NumPy, pandas, and plotnine. The first step is to import these libraries, configure logger to only report errors, and set a random seed for NumPy.

Code

import logging
import scipy as sp
import numpy as np
import pandas as pd
import plotnine as pn
import cmdstanpy as csp

pd.set_option('display.max_rows', None)  # show whole pandas data frames
csp.utils.get_logger().setLevel(logging.ERROR)  # only log errors

np.random.seed(12345)   # change seed for fresh simulation

Second, define functions to use later.

def rw_cov_matrix(D, rho):
    Sigma = np.zeros((D, D))
    for i in range(D):
        for j in range(D):
            Sigma[i, j] = rho ** abs(i - j)
    return Sigma

def random_predictors(N, D, rho):
    Sigma = rw_cov_matrix(D, rho)
    mu = np.zeros(D)
    x = np.random.multivariate_normal(mu, Sigma, N)
    for n in range(N):
        x[n, 1] = 1.0  # intercept
    return x
    
def sq_error(u, v):
    return sum((u - v)**2)

def fit_bayes(model, data_dict):
    return model.sample(data = data_dict, show_progress = False,
                        chains=1, iter_warmup=2000, iter_sampling=2000,
                        show_console = False, seed=12345, inits = 0)

def fit_bayes_draws(model, data_dict):
    fit = fit_bayes(model, data_dict)
    return fit.stan_variable("beta")

def fit_mle(model, data_dict):
    mle = model.optimize(data = data_dict, show_console = False,
                         seed=12345)
    return mle.stan_variable("beta")

def elpd(test_data, fit, model_predict):
    predict_fit = model_predict.generate_quantities(
         data = test_data,  previous_fit = fit,
         show_console = False,  seed = 12345)
    log_p_draws = predict_fit.stan_variable("log_p")
    return (sp.special.logsumexp(log_p_draws) - np.log(log_p_draws.size)) / test_data['y'].size

def inv_logit(x):
    return 1 / (1 + exp(-x))

def add_row(df, beta_hat, beta, estimator, data):
    return pd.concat([df, pd.DataFrame({'error': (sq_error(beta_hat, beta), ),
                                        'estimator': (estimator, ),
                                        'data': (data, )})],
                         ignore_index=True)

def add_elpd(df, elpd, estimator, data):
    return pd.concat([df, pd.DataFrame({'ELPD': (elpd, ),
                                        'estimator': (estimator, ),
                                        'data': (data, )})],
                         ignore_index=True)

Third, compile the Stan programs introduced in the previous section.

model_logistic = csp.CmdStanModel(stan_file = "logistic-regression.stan")
model_weighted_logistic = csp.CmdStanModel(stan_file = "weighted-logistic-regression.stan")
model_log_odds_linear = csp.CmdStanModel(stan_file = "log-odds-linear-regression.stan")
model_predict = csp.CmdStanModel(stan_file = "logistic-regression-predict.stan")

Fourth, set all the constants determining sizes for the simulation.

# D = 11, N = 500, rho = 0.9 good

D = 32            # number of predictors including intercept
N = 1024          # number of data points used to train
rho = 0.9         # correlation of predictor RW covariance
N_test = 1024     # number of test items
M = 32            # number of simulation runs

Fifth, allocate a pandas data frame to store the results, then collect results M iterations. Within each iteration, generate predictors and the various outcomes randomly. For each of these, calculate the penalized MLE and Bayesian posterior mean and add them to the data frame.

def inv_logit(u):
    return 1 / (1 + np.exp(-u))

df = pd.DataFrame({'error': (), 'estimator': (), 'data': ()})
for m in range(M):
    # parameter generation
    beta = np.random.normal(0, 1, D)

    # Training data generation
    x = random_predictors(N, D, rho)
    x[:, 0] = 1  # intercept
    E_log_odds = np.dot(x, beta)
    E_y = inv_logit(E_log_odds)
    y_max = np.where(E_y > 0.5, 1, 0)
    y_random = np.random.binomial(n=1, p=E_y)
    p = E_y
    y_noisy_log_odds = E_log_odds + np.random.normal(0, 1, N)
    noisy_p = inv_logit(y_noisy_log_odds)
    data_max = {'D': D, 'N': N, 'x': x, 'y': y_max }
    data_random = {'D': D, 'N': N, 'x': x, 'y': y_random }
    data_probs = {'D': D, 'N': N, 'x': x, 'p': p }
    data_noisy_weights = {'D': D, 'N': N, 'x': x, 'p': noisy_p}

    # Penalized MLE
    mle_max = fit_mle(model_logistic, data_max)
    mle_random = fit_mle(model_logistic, data_random)
    mle_probs = fit_mle(model_weighted_logistic, data_probs)
    mle_weights = fit_mle(model_log_odds_linear, data_probs)
    mle_noisy = fit_mle(model_log_odds_linear, data_noisy_weights)
    df = add_row(df, mle_max, beta, "MLE", "max prob")        
    df = add_row(df, mle_random, beta, "MLE", "random")        
    df = add_row(df, mle_probs, beta, "MLE", "weighted")        
    df = add_row(df, mle_weights, beta, "MLE", "log odds")
    df = add_row(df, mle_noisy, beta, "MLE", "noisy odds")        

    # Bayesian
    beta_draws_max = fit_bayes_draws(model_logistic, data_max)
    beta_draws_random = fit_bayes_draws(model_logistic, data_random)
    beta_draws_probs = fit_bayes_draws(model_weighted_logistic, data_probs)
    beta_draws_weights = fit_bayes_draws(model_log_odds_linear, data_probs)
    beta_draws_noisy_weights = fit_bayes_draws(model_log_odds_linear, data_noisy_weights)
    mean_max = np.zeros(D)
    mean_random = np.zeros(D)
    mean_probs = np.zeros(D)
    mean_weights = np.zeros(D)
    mean_noisy = np.zeros(D)
    for d in range(D):
        mean_max[d] = np.mean(beta_draws_max[:, d])
        mean_random[d] = np.mean(beta_draws_random[:, d])
        mean_probs[d] = np.mean(beta_draws_probs[:, d])
        mean_weights[d] = np.mean(beta_draws_weights[:, d])
        mean_noisy[d] = np.mean(beta_draws_noisy_weights[:, d])
    df = add_row(df, mean_max, beta, "Bayes", "max prob")        
    df = add_row(df, mean_random, beta, "Bayes", "random")        
    df = add_row(df, mean_probs, beta, "Bayes", "weighted")        
    df = add_row(df, mean_weights, beta, "Bayes", "log odds")        
    df = add_row(df, mean_noisy, beta, "Bayes", "noisy odds")

Finally, print the results as a table with a custom reporter for means, 0.1, 0.5 (median), 0.9 quantiles, and the minimum and maximum, rounding to a single decimal place.

def summary(x):
    return pd.Series({
        'mean': np.mean(x),
        '10%': np.quantile(x, 0.1),
        'median': np.quantile(x, 0.5),
        '90%': np.quantile(x, 0.9),
        'min': np.min(x),
        'max': np.max(x)
    })

summary_table = df.groupby(['estimator','data'])['error'].apply(summary).unstack()
print(summary_table.round(1))

                      mean  10%  median   90%  min   max
estimator data                                          
Bayes     log odds     1.2  0.1     0.6   3.6  0.0   5.4
          max prob    15.1  8.7    14.5  22.0  5.2  29.9
          noisy odds   1.5  0.4     1.0   3.9  0.2   5.5
          random       4.9  2.4     4.5   8.1  1.4   9.6
          weighted     2.2  0.6     1.7   4.1  0.3   7.2
MLE       log odds     1.2  0.1     0.7   3.7  0.0   5.4
          max prob    11.9  6.1    11.3  18.1  3.8  25.6
          noisy odds   1.5  0.4     0.9   3.9  0.2   5.7
          random       5.0  2.5     4.5   8.3  1.4  10.0
          weighted     2.6  0.8     2.1   4.3  0.5   7.9

Here’s a box and whisker plot of the results.

plot_whisker = ( pn.ggplot(df, pn.aes(x='data', y='error', fill='data'))
    + pn.geom_boxplot()
    + pn.facet_grid('. ~ estimator')
    + pn.aes(group='data')
    + pn.scale_y_log10()
    + pn.theme(figure_size=(10, 5)) )
print(plot_whisker)

The simulations clearly show that training with logit-transformed probabilities is the most effective, followed by noisy logit-transformed probabilities and weighted training, followed by random sampling, and finally, far behind, taking the most probable category.

The penalized maximum likelihood estimator is better at handling the most probable category sampling, but is otherwise similar or slightly trails the Bayesian estimator.

7.3 Evaluation of predictive inference

For evaluation of both fully Bayesian and plug-in inference, the following Stan model suffices.

logistic-regression-predict.stan

data {
  int<lower=1> D;                                   // num covariates per item
  int<lower=0> N;                                   // num observations
  matrix[N, D] x;                                   // test covariates
  array[N] int<lower=0, upper=1> y;                 // outcomes
}
parameters {
  vector[D] beta;                                   // parameters
}
generated quantities {
  real log_p = bernoulli_logit_lpmf(y | x * beta);  // likelihood
}

The data includes both the test covariates and the test outcomes. The parameters are the same as in the models for training. For predictive inference, the model block is replaced with a generated quantities block that assigns the log density of the data given the parameters to the variable log_p_t.

The evaluation follows the earlier evaluation, only we now measure predictive accuracy rather than estimation accuracy.

df_predict = pd.DataFrame({'ELPD': (), 'estimator': (), 'data': ()})
for m in range(M):
    beta = np.random.normal(0, 1, D)

    # Training data generation
    x = random_predictors(N, D, rho)
    E_log_odds = np.dot(x, beta)
    E_y = inv_logit(E_log_odds)
    y_max = np.where(E_y > 0.5, 1, 0)
    y_random = np.random.binomial(n=1, p=E_y)
    noisy_E_log_odds = E_log_odds + np.random.normal(0, 1, N)
    noisy_E_y = inv_logit(noisy_E_log_odds)
    data_max = {'D': D, 'N': N, 'x': x, 'y': y_max }
    data_random = {'D': D, 'N': N, 'x': x, 'y': y_random }
    data_probs = {'D': D, 'N': N, 'x': x, 'p': E_y }
    data_noisy_weights = {'D': D, 'N': N, 'x': x, 'p': noisy_E_y}

    # Test data generation
    x_test = random_predictors(N_test, D, rho)
    E_y_test = inv_logit(np.dot(x_test, beta))
    y_test = np.random.binomial(n = 1, p = E_y_test)

    # Penalized MLE fit
    mle_max = fit_mle(model_logistic, data_max)
    mle_random = fit_mle(model_logistic, data_random)
    mle_probs = fit_mle(model_weighted_logistic, data_probs)
    mle_weights = fit_mle(model_log_odds_linear, data_probs)
    mle_noisy = fit_mle(model_log_odds_linear, data_noisy_weights)

    # Penalized MLE prediction
    lp_mle_max = sp.stats.bernoulli.logpmf(y_test, inv_logit(np.dot(x_test, mle_max))).sum()
    lp_mle_random = sp.stats.bernoulli.logpmf(y_test,  inv_logit(np.dot(x_test, mle_random))).sum()
    lp_mle_probs = sp.stats.bernoulli.logpmf(y_test, inv_logit(np.dot(x_test, mle_probs))).sum()
    lp_mle_weights = sp.stats.bernoulli.logpmf(y_test, inv_logit(np.dot(x_test, mle_weights))).sum()
    lp_mle_noisy_weights = sp.stats.bernoulli.logpmf(y_test, inv_logit(np.dot(x_test, mle_noisy))).sum()

    log_Pr_rate_max = lp_mle_max / N_test
    log_Pr_rate_random = lp_mle_random / N_test
    log_Pr_rate_probs = lp_mle_probs / N_test
    log_Pr_rate_weights = lp_mle_weights / N_test
    log_Pr_rate_noisy_weights = lp_mle_noisy_weights / N_test

    df_predict = add_elpd(df_predict, log_Pr_rate_max, "MLE", "max prob")
    df_predict = add_elpd(df_predict, log_Pr_rate_random, "MLE", "random")
    df_predict = add_elpd(df_predict, log_Pr_rate_probs, "MLE", "weighted")
    df_predict = add_elpd(df_predict, log_Pr_rate_weights, "MLE", "log odds")
    df_predict = add_elpd(df_predict, log_Pr_rate_noisy_weights, "MLE", "noisy odds")

    # Bayesian fit
    fit_max = fit_bayes(model_logistic, data_max)
    fit_random = fit_bayes(model_logistic, data_random)
    fit_probs = fit_bayes(model_weighted_logistic, data_probs)
    fit_weights = fit_bayes(model_log_odds_linear, data_probs)
    fit_noisy_weights = fit_bayes(model_log_odds_linear, data_noisy_weights)

    # Bayesian prediction
    test_data = {'D': D, 'N': N_test, 'x': x_test, 'y': y_test}
    elpd_max = elpd(test_data, fit_max, model_predict)
    elpd_random = elpd(test_data, fit_random, model_predict)
    elpd_probs = elpd(test_data, fit_probs, model_predict)
    elpd_weights = elpd(test_data, fit_weights, model_predict)
    elpd_noisy_weights = elpd(test_data, fit_noisy_weights, model_predict)
    df_predict = add_elpd(df_predict, elpd_max, "Bayes", "max prob")
    df_predict = add_elpd(df_predict, elpd_random, "Bayes", "random")
    df_predict = add_elpd(df_predict, elpd_probs, "Bayes", "weighted")
    df_predict = add_elpd(df_predict, elpd_weights, "Bayes", "log odds")
    df_predict = add_elpd(df_predict, elpd_noisy_weights, "Bayes", "noisy odds")

After building up the results, the results can be summarized with a pandas one-liner. In what follows, higher ELPD is better—it means assigning a higher probability to the observed outcomes.

summary_table_predict = df_predict.groupby(['estimator','data'])['ELPD'].apply(summary).unstack()
print(summary_table_predict.round(2))

                      mean   10%  median   90%   min   max
estimator data                                            
Bayes     log odds   -0.25 -0.34   -0.23 -0.20 -0.41 -0.16
          max prob   -0.29 -0.40   -0.27 -0.22 -0.53 -0.18
          noisy odds -0.25 -0.34   -0.23 -0.20 -0.41 -0.16
          random     -0.26 -0.36   -0.24 -0.21 -0.42 -0.17
          weighted   -0.25 -0.35   -0.23 -0.20 -0.41 -0.17
MLE       log odds   -0.25 -0.34   -0.23 -0.20 -0.41 -0.16
          max prob   -0.30 -0.42   -0.28 -0.23 -0.56 -0.18
          noisy odds -0.25 -0.34   -0.23 -0.20 -0.41 -0.16
          random     -0.26 -0.35   -0.25 -0.21 -0.42 -0.18
          weighted   -0.25 -0.34   -0.23 -0.20 -0.41 -0.17

And here’s the plot, where larger numbers are also better.

plot_whisker_predict = (
    pn.ggplot(df_predict, pn.aes(x='data', y='ELPD', fill='data'))
    + pn.geom_boxplot()
    + pn.facet_grid('. ~ estimator')
    + pn.aes(group='data')
    + pn.theme(figure_size=(10, 5)) )
print(plot_whisker_predict)

8 Further reading

Logistic regression: Gelman, Hill, and Vehtari (2020) provide a detailed, practical introduction to regression modeling. For a more Bayesian perspective, see Gelman et al. (2013).

Probabilistic training: Smyth et al. (1994) is the first example of which I’m aware of jointly training a classifier and a crowdsourcing model. Raykar et al. (2010) performs joint training in a logistic regression setting using Dawid and Skene’s model.

Calibration and scoring metrics: Gneiting and Raftery (2007) and Gneiting, Balabdaoui, and Raftery (2007) provide an excellent overview of proper scoring metrics and calibrated prediction, respectively.

Rating models: Dawid and Skene (1979) first modeled crowdsourcing as noisy raters providing measurements conditioned on the true value. In a binary task, each rater’s sensitivity and specificity is estimated, as is the prevalence of successful (1) outcomes, which allows a probabilistic assignment of category to each item. It is straightforward to extend Dawid and Skene’s model to a Bayesian hierarchical model as shown by Paun et al. (2018) and in the “Latent discrete parameters” chapter of the Stan User’s Guide. Passonneau and Carpenter (2014) provide a high-level motivation and analysis of rating versus weighted voting or inter-annotator agreement measures. There is a rich literature on similar models in epidemiology (diagnostic testing without a gold standard), educational testing (exam grading without an answer key), and anthropology (cultural consensus theory).

Stan: The Stan Reference Manual, by the Stan Development Team (2023a), describes the Stan programming language. The User’s Guide, by the Stan Development Team (2023b), describes how to use the Stan language to code statistical models, including both hierarchical logistic regression and the Dawid and Skene crowdsourcing model.

9 Conclusions

If you have a data set with probabilistic labels, the best thing to do is to train with those probabilities. In the case of logistic regression, you can go one step better by log odds transforming the probabilities and directly training a linear regression. If you are restricted to estimation (training) software that can only handle deterministic categories, then the best thing to do is to sample the categories randomly given the probabilities. Whatever you do, do not take the most probable category or use majority voting for crowdsourcing.

Appendix: System settings

The system and operating system are as follows.

import sys
import platform

print("\nSystem")
print(sys.version)

print("\nOperating System")
print(f"""  {platform.system() = }
  {platform.release() = }
  {platform.version() = }""")


System
3.9.4 (default, Apr  5 2021, 01:47:16) 
[Clang 11.0.0 (clang-1100.0.33.17)]

Operating System
  platform.system() = 'Darwin'
  platform.release() = '23.1.0'
  platform.version() = 'Darwin Kernel Version 23.1.0: Mon Oct  9 21:27:27 PDT 2023; root:xnu-10002.41.9~6/RELEASE_X86_64'

The installed packages (i.e., the working set) is as follows.

import pkg_resources

print("\nInstalled packages:")
for dist in pkg_resources.working_set:
    print(dist)


Installed packages:
Babel 2.12.1
Cython 0.29.33
Fiona 1.9.4.post1
Jinja2 3.0.3
Mako 1.1.6
Markdown 3.3.6
MarkupSafe 2.0.1
Pillow 9.2.0
PyYAML 6.0
Pygments 2.15.1
QtPy 2.3.0
SQLAlchemy 2.0.10
Send2Trash 1.8.0
adjustText 0.8
aiohttp 3.8.4
aiosignal 1.3.1
alabaster 0.7.13
appdirs 1.4.4
appnope 0.1.2
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arviz 0.14.0
asttokens 2.0.5
async-timeout 4.0.2
attrs 20.3.0
autograd 1.5
backcall 0.2.0
black 22.3.0
bleach 4.1.0
certifi 2023.5.7
cffi 1.15.0
cftime 1.6.2
charset-normalizer 3.1.0
click 8.1.3
click-plugins 1.1.1
cligj 0.7.2
clikit 0.6.2
cmdstanpy 1.2.0
comm 0.1.2
contourpy 1.1.0
coverage 5.5
crashtest 0.3.1
cycler 0.11.0
debugpy 1.6.6
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.6
docutils 0.20.1
entrypoints 0.4
executing 0.8.3
fastaparser 1.1
flake8 6.0.0
fonttools 4.34.4
frozenlist 1.3.3
future 0.18.3
geopandas 0.13.2
greenlet 2.0.2
httpstan 4.9.1
idna 3.4
imagesize 1.4.1
importlib-metadata 6.7.0
importlib-resources 5.12.0
iniconfig 1.1.1
ipykernel 6.22.0
ipython 8.1.1
ipython-genutils 0.2.0
ipywidgets 8.0.4
jaraco.classes 3.2.3
jedi 0.18.1
joblib 1.2.0
jsonschema 4.4.0
jupyter 1.0.0
jupyter-cache 0.6.1
jupyter-client 7.1.2
jupyter-console 6.6.3
jupyter-core 5.3.0
jupyterlab-pygments 0.1.2
jupyterlab-widgets 3.0.5
keyring 24.2.0
kiwisolver 1.4.4
markdown-it-py 3.0.0
marshmallow 3.19.0
matplotlib 3.7.1
matplotlib-inline 0.1.3
mccabe 0.7.0
mdurl 0.1.2
mistune 0.8.4
mizani 0.9.2
more-itertools 9.1.0
multidict 6.0.4
mypy 0.991
mypy-extensions 0.4.3
nbclient 0.5.12
nbconvert 6.4.2
nbformat 5.1.3
nbsphinx 0.9.2
nest-asyncio 1.5.4
netCDF4 1.6.2
notebook 6.4.8
numpy 1.25.0
numpydoc 1.5.0
packaging 21.3
palettable 3.3.0
pandas 2.0.3
pandocfilters 1.5.0
parso 0.8.3
pastel 0.2.1
patchworklib 0.6.2
pathspec 0.9.0
patsy 0.5.2
pdoc3 0.10.0
pexpect 4.8.0
pickleshare 0.7.5
pip 23.2.1
pkginfo 1.9.6
platformdirs 2.5.2
plotly 5.13.1
plotnine 0.12.1
pluggy 0.13.1
prometheus-client 0.13.1
prompt-toolkit 3.0.38
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.10.0
pycodestyle 2.10.0
pycparser 2.21
pyflakes 3.0.1
pylev 1.4.0
pyparsing 2.4.7
pyproj 3.6.0
pyrsistent 0.18.1
pysimdjson 5.0.2
pystan 2.19.1.1
pytest 6.2.3
pytest-cov 2.11.1
pytest-runner 5.3.0
python-dateutil 2.8.2
pytz 2022.1
pyzmq 22.3.0
qtconsole 5.4.1
readme-renderer 40.0
requests 2.31.0
requests-toolbelt 1.0.0
rfc3986 2.0.0
rich 13.4.2
scikit-learn 1.3.0
scikit-misc 0.2.0
scipy 1.10.0
setuptools 66.1.1
shapely 2.0.1
six 1.16.0
snowballstemmer 2.2.0
sphinx 7.0.1
sphinxcontrib-applehelp 1.0.4
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.1
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
stack-data 0.2.0
stanio 0.3.0
statsmodels 0.14.0
tabulate 0.9.0
tenacity 8.2.2
terminado 0.13.3
testpath 0.6.0
threadpoolctl 3.1.0
toml 0.10.2
tomli 2.0.1
tornado 6.1
tqdm 4.64.0
traitlets 5.9.0
twine 4.0.2
typed-ast 1.4.3
typing-extensions 4.2.0
tzdata 2023.3
ujson 5.4.0
urllib3 2.0.3
vistan 0.0.0.7
wcwidth 0.2.5
webargs 8.2.0
webencodings 0.5.1
wheel 0.36.2
widgetsnbextension 4.0.5
xarray 2023.1.0
xarray-einstats 0.5.1
yarl 1.8.2
zipp 3.6.0
sciware-testing-python 0.1.0
bayes-infer 0.0.1

References

Dawid, Alexander Philip, and Allan M Skene. 1979. “Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1): 20–28.

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. Third Edition. CRC Press.

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.

Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society Series B: Statistical Methodology 69 (2): 243–68.

Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.

Passonneau, Rebecca J, and Bob Carpenter. 2014. “The Benefits of a Model of Annotation.” Transactions of the Association for Computational Linguistics 2: 311–26.

Paun, Silviu, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, and Massimo Poesio. 2018. “Comparing Bayesian Models of Annotation.” Transactions of the Association for Computational Linguistics 6: 571–85.

Raykar, Vikas C, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. “Learning from Crowds.” Journal of Machine Learning Research 11 (4).

Smyth, Padhraic, Usama Fayyad, Michael Burl, Pietro Perona, and Pierre Baldi. 1994. “Inferring Ground Truth from Subjective Labelling of Venus Images.” Advances in Neural Information Processing Systems 7.

Stan Development Team. 2023a. “Stan Reference Manual.” https://mc-stan.org/docs/reference-manual/index.html.

———. 2023b. “Stan Users Guide.” https://mc-stan.org/docs/stan-users-guide/index.html.