Raykar et al. Learning from Crowds model

I’m a huge fan of Raykar et al.'s paper on learning from crowds. Although it’s aimed at crowdsourcing data platforms for machine learning, like Amazon’s Mechanical Turk, the technique is generally useful in any kind of noisy rating (coding, annotation, diagnostic test) situation. These general families of models are well studied in the epidemiology literature where instead of raters rating data, it’s diagnostic tests being applied to subjects.

The citation is:

Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research , 11 (Apr), 1297-1322.

Raykar et al. basically just replace the simple multinomial prevalence model of Dawid and Skene with a logistic regression (they only consider the binary case, but it’s easy to extend to K ratings or even ordinal ratings).

The paper spends a lot of time deriving an algorithm for estimation. It’s much less effort to just plug into Stan (though maybe doesn’t look so impressive, because there’s otherwise not much algebra required.

Here’s the model code

/**
 * Implementation of joint estimation of logistic regression and
 * annotator sensitivity and specifity.  This is just like the Dawid-Skene
 * model with the basic prevalence replaced with a logistic regression
 * based on predictors.
 *
 * Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C.,
 * Bogoni, L., & Moy, L.  (2010). Learning from crowds. Journal of
 * Machine Learning Research, 11(Apr), 1297-1322.
 *
 * TODO:  hierarchical priors for alpha, beta
 * TODO:  precompute counts for sum(y[ , r] == 1) and multiply
 *        in the N/R loops and the eltwise multiply and sum
 */
data {
  int<lower=0> N;                 // # items (reviews)
  int<lower=0> R;                 // # raters (heuristics)
  int<lower=0> D;                 // # of predictors for each item
  matrix[N, D] x;                 // [n, d] predictors d for item n
  int<lower=0, upper=1> y[N, R];  // outcome: 1 success, 0 failure
}
parameters {
  vector[D] w;                         // logistic regression coeffs
  real w0;                             // intercept
  vector<lower=0, upper=1>[R] alpha;   // sensitivity
  vector<lower=0, upper=1>[R] beta;    // specificity
}
model {
  vector[N] logit_z_hat = w0 + x * w;
  vector[N] log_z_hat = log_inv_logit(logit_z_hat);
  vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);

  vector[R] log_alpha = log(alpha);
  vector[R] log1m_alpha = log1m(alpha);
  vector[R] log_beta = log(beta);
  vector[R] log1m_beta = log1m(beta);

  // prior
  w ~ normal(0, 2);
  w0 ~ normal(0, 5);
  alpha ~ beta(10, 1);
  beta ~ beta(10, 1);

  // likelihood
  for (n in 1:N) {
    real pos_sum = log_z_hat[n];
    real neg_sum = log1m_z_hat[n];
    for (r in 1:R) {
      if (y[n, r] == 1) {
        pos_sum += log_alpha[r];
        neg_sum += log1m_beta[r];
      } else {
        pos_sum += log1m_alpha[r];
        neg_sum += log_beta[r];
      }
    }
    target += log_sum_exp(pos_sum, neg_sum);
  }
}
generated quantities {
  vector[N] Pr_z_eq_1;
  {
    vector[N] logit_z_hat = w0 + x * w;
    vector[N] log_z_hat = log_inv_logit(logit_z_hat);
    vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);

    vector[R] log_alpha = log(alpha);
    vector[R] log1m_alpha = log1m(alpha);
    vector[R] log_beta = log(beta);
    vector[R] log1m_beta = log1m(beta);

    for (n in 1:N) {
      real pos_sum = log_z_hat[n];
      real neg_sum = log1m_z_hat[n];
      for (r in 1:R) {
        if (y[n, r] == 1) {
          pos_sum += log_alpha[r];
          neg_sum += log1m_beta[r];
        } else {
          pos_sum += log1m_alpha[r];
          neg_sum += log_beta[r];
        }
      }
      Pr_z_eq_1[n] = softmax([pos_sum, neg_sum]')[1];
    }
  }
}

I noted some performance improvements that could be made in the to-do section. It could also be easily extended to incomplete panel data (when not every rater rates every item). It could also be extended to hierarchical priors. @Marko’s example Dawid-Skene annotation model with partial pooling of annotators covers both hierarchical priors and incomplete data.

And here’s some R code to simulate data

inv_logit <- function(u) 1 / (1 + exp(-u))

N <- 500
R <- 10
D <- 20

x <- matrix(rnorm(N * D), N, D)
w <- rnorm(D)
w0 <- -2

alpha <- runif(R, 0.65, 0.95)
beta <- runif(R, 0.65, 0.95)

z <- rep(0, N);
for (n in 1:N)
  z[n] <- rbinom(1, 1, inv_logit(w0 + (x[n, ] %*% w)))

y <- matrix(0, N, R);
for (n in 1:N)
  for (r in 1:R)
    y[n, r] <- rbinom(1, 1, ifelse(z[n], alpha[r], 1 - beta[r]))


init_fun <- function() {
  return(list(alpha=rep(0.75,R), beta=rep(0.75,R), w=rep(0,D), w0=0));
}

and some R code to fit:

library('rstan')
fit <- stan("raykar-marginal.stan",
            data = c("N","R","D","x","y"),
            init = function(n) list(alpha = rep(0.8, R), beta = rep(0.8, R), w = rep(0, D), w0 = 0))

I checked that it recovered parameters accurately. Like the Dawid and Skene example, it starts with an initialization that assumes the annotators are not antagonistic. It also zeroes the slopes and intercept for the regression in the init. This should help if not totally prevent the chance of having a multimodal solution where one mode is that the annotators are accurate, but intentionally returning the wrong answers.

6 Likes

@Bob_Carpenter I tried this model with simulated data and varying performance measures (se and sp) and it works really well. Thank you.
At the moment the model assumes that each rater will score all the items being y[N,R].
Has anyone implemented this in a way that allows different number of raters per item?
Thanks in advance