Raykar et al. Learning from Crowds model

Bob_Carpenter · August 24, 2018, 12:27pm

I’m a huge fan of Raykar et al.'s paper on learning from crowds. Although it’s aimed at crowdsourcing data platforms for machine learning, like Amazon’s Mechanical Turk, the technique is generally useful in any kind of noisy rating (coding, annotation, diagnostic test) situation. These general families of models are well studied in the epidemiology literature where instead of raters rating data, it’s diagnostic tests being applied to subjects.

The citation is:

Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research , 11 (Apr), 1297-1322.

Raykar et al. basically just replace the simple multinomial prevalence model of Dawid and Skene with a logistic regression (they only consider the binary case, but it’s easy to extend to K ratings or even ordinal ratings).

The paper spends a lot of time deriving an algorithm for estimation. It’s much less effort to just plug into Stan (though maybe doesn’t look so impressive, because there’s otherwise not much algebra required.

Here’s the model code

/**
 * Implementation of joint estimation of logistic regression and
 * annotator sensitivity and specifity.  This is just like the Dawid-Skene
 * model with the basic prevalence replaced with a logistic regression
 * based on predictors.
 *
 * Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C.,
 * Bogoni, L., & Moy, L.  (2010). Learning from crowds. Journal of
 * Machine Learning Research, 11(Apr), 1297-1322.
 *
 * TODO:  hierarchical priors for alpha, beta
 * TODO:  precompute counts for sum(y[ , r] == 1) and multiply
 *        in the N/R loops and the eltwise multiply and sum
 */
data {
  int<lower=0> N;                 // # items (reviews)
  int<lower=0> R;                 // # raters (heuristics)
  int<lower=0> D;                 // # of predictors for each item
  matrix[N, D] x;                 // [n, d] predictors d for item n
  int<lower=0, upper=1> y[N, R];  // outcome: 1 success, 0 failure
}
parameters {
  vector[D] w;                         // logistic regression coeffs
  real w0;                             // intercept
  vector<lower=0, upper=1>[R] alpha;   // sensitivity
  vector<lower=0, upper=1>[R] beta;    // specificity
}
model {
  vector[N] logit_z_hat = w0 + x * w;
  vector[N] log_z_hat = log_inv_logit(logit_z_hat);
  vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);

  vector[R] log_alpha = log(alpha);
  vector[R] log1m_alpha = log1m(alpha);
  vector[R] log_beta = log(beta);
  vector[R] log1m_beta = log1m(beta);

  // prior
  w ~ normal(0, 2);
  w0 ~ normal(0, 5);
  alpha ~ beta(10, 1);
  beta ~ beta(10, 1);

  // likelihood
  for (n in 1:N) {
    real pos_sum = log_z_hat[n];
    real neg_sum = log1m_z_hat[n];
    for (r in 1:R) {
      if (y[n, r] == 1) {
        pos_sum += log_alpha[r];
        neg_sum += log1m_beta[r];
      } else {
        pos_sum += log1m_alpha[r];
        neg_sum += log_beta[r];
      }
    }
    target += log_sum_exp(pos_sum, neg_sum);
  }
}
generated quantities {
  vector[N] Pr_z_eq_1;
  {
    vector[N] logit_z_hat = w0 + x * w;
    vector[N] log_z_hat = log_inv_logit(logit_z_hat);
    vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);

    vector[R] log_alpha = log(alpha);
    vector[R] log1m_alpha = log1m(alpha);
    vector[R] log_beta = log(beta);
    vector[R] log1m_beta = log1m(beta);

    for (n in 1:N) {
      real pos_sum = log_z_hat[n];
      real neg_sum = log1m_z_hat[n];
      for (r in 1:R) {
        if (y[n, r] == 1) {
          pos_sum += log_alpha[r];
          neg_sum += log1m_beta[r];
        } else {
          pos_sum += log1m_alpha[r];
          neg_sum += log_beta[r];
        }
      }
      Pr_z_eq_1[n] = softmax([pos_sum, neg_sum]')[1];
    }
  }
}

I noted some performance improvements that could be made in the to-do section. It could also be easily extended to incomplete panel data (when not every rater rates every item). It could also be extended to hierarchical priors. @Marko’s example Dawid-Skene annotation model with partial pooling of annotators covers both hierarchical priors and incomplete data.

And here’s some R code to simulate data

inv_logit <- function(u) 1 / (1 + exp(-u))

N <- 500
R <- 10
D <- 20

x <- matrix(rnorm(N * D), N, D)
w <- rnorm(D)
w0 <- -2

alpha <- runif(R, 0.65, 0.95)
beta <- runif(R, 0.65, 0.95)

z <- rep(0, N);
for (n in 1:N)
  z[n] <- rbinom(1, 1, inv_logit(w0 + (x[n, ] %*% w)))

y <- matrix(0, N, R);
for (n in 1:N)
  for (r in 1:R)
    y[n, r] <- rbinom(1, 1, ifelse(z[n], alpha[r], 1 - beta[r]))


init_fun <- function() {
  return(list(alpha=rep(0.75,R), beta=rep(0.75,R), w=rep(0,D), w0=0));
}

and some R code to fit:

library('rstan')
fit <- stan("raykar-marginal.stan",
            data = c("N","R","D","x","y"),
            init = function(n) list(alpha = rep(0.8, R), beta = rep(0.8, R), w = rep(0, D), w0 = 0))

I checked that it recovered parameters accurately. Like the Dawid and Skene example, it starts with an initialization that assumes the annotators are not antagonistic. It also zeroes the slopes and intercept for the regression in the init. This should help if not totally prevent the chance of having a multimodal solution where one mode is that the annotators are accurate, but intentionally returning the wrong answers.

Hemingway · March 25, 2020, 7:21am

@Bob_Carpenter I tried this model with simulated data and varying performance measures (se and sp) and it works really well. Thank you.
At the moment the model assumes that each rater will score all the items being y[N,R].
Has anyone implemented this in a way that allows different number of raters per item?
Thanks in advance

Bob_Carpenter · April 2, 2020, 9:57pm

Sorry for missing this before. I’m kind of overwhelmed on Discourse.

All you need to do is to move it to long form like for other models. There’s a chapter in the user’s guide on how to implement ragged structures like this. But the idea is simple. Just create a long form like this:

rr, nn, y
...

where rr[i] is the rater, nn[i] is the item being rated, and y[i] is the rating. The predictors can be organized as before as they only go into the prevalence model. Then rather than looping through items and annotators, you just loop through rows. Each annotation adds a term to the log density. The tricky part’s just getting the likelihood terms into the log-sum-exp. The easiest way to do that is just accumulate terms for each of them then do them all at the end. That is, something like:


data {
  // R and N and D as before but instead of the old y, this:
  int<lower = 0> I;  // number of observations
  int<lower = 1, upper = N> nn[I];  // item for observation i
  int<lower = 1, upper = R> rr[i];  // rather for observation i
  int<lower = 0, upper = 1> y[i];  // rating for observation i
...
model {
...
{ 
  vector[N] pos_sum = log_z_hat;
  vector[N] neg_sum = log1m_z_hat;
  for (i in 1:I)
    if (y[i] == 1) { 
      pos_sum[nn[i]] += log_alpha[rr[i]];
      neg_sum[nn[i]] += log1m_beta[rr[i]];
   } else {
      pos_sum[nn[i]] += log1m_alpha[rr[i]];
      neg_sum += log_beta[rr[i]];
    }
  }
  for (n in 1:N)
    target += log_sum_exp(pos_sum[n], neg_sum[n]);
}

where nn[i] is the item for observation i, rr[i] is the rate for item i, and y[i] is the rating.

That corresponds to the bit of the model I had in the fully observed case:

  // likelihood
  for (n in 1:N) {
    real pos_sum = log_z_hat[n];
    real neg_sum = log1m_z_hat[n];
    for (r in 1:R) {
      if (y[n, r] == 1) {
        pos_sum += log_alpha[r];
        neg_sum += log1m_beta[r];
      } else {
        pos_sum += log1m_alpha[r];
        neg_sum += log_beta[r];
      }
    }
    target += log_sum_exp(pos_sum, neg_sum);
  }

I haven’t tested it, but the basic idea of the raggedness encoding should be right.

I think @ducoveen may have also implemented this in ragged form.

Hemingway · April 3, 2020, 3:49am

Dear Bob,
Thank you so much for taking the time to reply and your detailed suggestions.

I have modified your model/codes. However, the model doesn’t seem correct and it does not produce good parameter estimates. Probably I’m missing something.

I’m not sure if the generated quantities block is OK.

Here the reproducible example:


library('rstan')
library('reshape2')

inv_logit <- function(u) 1 / (1 + exp(-u))

N <- 50 # items
R <- 30 # subj
D <- 20 # predictors

x <- matrix(rnorm(N * D), N, D)
#x <- cbind(rep(1,nrow(x)), x)

w0 <- -2 # intercep
w <- rnorm(D) # intercep & reg coef

alpha <- runif(R, 0.65, 0.95) # se
beta <- runif(R, 0.65, 0.95)  # sp

z <- rep(0, N);

for (n in 1:N)
  z[n] <- rbinom(1, 1, inv_logit(w0 + x[n, ] %*% w)) # true value

y <- matrix(0, N, R);
for (n in 1:N)
  for (r in 1:R)
    y[n, r] <- rbinom(1, 1, ifelse(z[n], alpha[r], 1 - beta[r])) # reported


df <- melt(y)
names(df) <- c("item", "rater", "y")

Model

raykar_long <- "
data {
 int<lower = 0> I;               // # of observations
 int<lower=0> N;                 // # items 
 int<lower=0> R;                 // # raters 
 int<lower = 1, upper = N> nn[I];  // item for observation i
 int<lower = 1, upper = R> rr[I];  // rater for observation i
 int<lower=0, upper=1> y[I];      // outcome: 1 success, 0 failure
 int<lower=0> D;                 // # of predictors for each item
 matrix[N, D] x;                 // [n, d] predictors d for item n
 
   }
parameters {
 vector[D] w;                         // logistic regression coef
 vector<lower=0, upper=1>[R] alpha;   // sensitivity
 vector<lower=0, upper=1>[R] beta;    // specificity
}

model {
 vector[N] logit_z_hat = x * w; //  
 vector[N] log_z_hat = log_inv_logit(logit_z_hat); // log p
 vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat); // log 1 - p
 
 vector[R] log_alpha = log(alpha);     
 vector[R] log1m_alpha = log1m(alpha); 
 vector[R] log_beta = log(beta); 
 vector[R] log1m_beta = log1m(beta);
 vector[N] pos_sum = log_z_hat;
 vector[N] neg_sum = log1m_z_hat;

 for (i in 1:I){
    if (y[i] == 1) { 
     pos_sum[nn[i]] += log_alpha[rr[i]];
     neg_sum[nn[i]] += log1m_beta[rr[i]];
  } else {
     pos_sum[nn[i]] += log1m_alpha[rr[i]];
     neg_sum += log_beta[rr[i]];
  }
 }  
 for (n in 1:N)
   target += log_sum_exp(pos_sum[n], neg_sum[n]);
 
 // priors
 for (d in 1:D){
 w[d] ~ normal(0, 2);
 }
 alpha ~ beta(10, 1);
 beta ~ beta(10, 1);
}
generated quantities {
 vector[N] Pr_z_eq_1;
 vector[N] logit_z_hat = x * w; 
 vector[N] log_z_hat = log_inv_logit(logit_z_hat);
 vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);
 vector[R] log_alpha = log(alpha);
 vector[R] log1m_alpha = log1m(alpha);
 vector[R] log_beta = log(beta);
 vector[R] log1m_beta = log1m(beta);
 
 vector[N] pos_sum = log_z_hat;
 vector[N] neg_sum = log1m_z_hat;
 for (i in 1:I){
   if (y[i] == 1) { 
     pos_sum[nn[i]] += log_alpha[rr[i]];
     neg_sum[nn[i]] += log1m_beta[rr[i]];
   } else {
     pos_sum[nn[i]] += log1m_alpha[rr[i]];
     neg_sum += log_beta[rr[i]];
   }
 }  
 for (n in 1:N){
   Pr_z_eq_1[n] = softmax([pos_sum[n], neg_sum[n]]')[1];
 }
}
"


data_list <- list(I = nrow(df),
                 N = length(unique(df$item)), 
                 R = length(unique(df$rater)),
                 nn = df$item,
                 rr = df$rater,
                 D = D+1,
                 x = x, # design matrix : intercept + predictors
                 y = df$y)


fit_long <- stan(model_code = raykar_long,raykar_long <- "
data {
 int<lower = 0> I;               // # of observations
 int<lower=0> N;                 // # items 
 int<lower=0> R;                 // # raters 
 int<lower = 1, upper = N> nn[I];  // item for observation i
 int<lower = 1, upper = R> rr[I];  // rater for observation i
 int<lower=0, upper=1> y[I];      // outcome: 1 success, 0 failure
 int<lower=0> D;                 // # of predictors for each item
 matrix[N, D] x;                 // [n, d] predictors d for item n
 
   }
parameters {
 vector[D] w;                         // logistic regression coef
 real w0;                             // intercept
 vector<lower=0, upper=1>[R] alpha;   // sensitivity
 vector<lower=0, upper=1>[R] beta;    // specificity
}

model {
 vector[N] logit_z_hat = w0 + x * w; //  
 vector[N] log_z_hat = log_inv_logit(logit_z_hat); // log p
 vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat); // log 1 - p
 
 vector[R] log_alpha = log(alpha);     
 vector[R] log1m_alpha = log1m(alpha); 
 vector[R] log_beta = log(beta); 
 vector[R] log1m_beta = log1m(beta);
 vector[N] pos_sum = log_z_hat;
 vector[N] neg_sum = log1m_z_hat;

 for (i in 1:I){
    if (y[i] == 1) { 
     pos_sum[nn[i]] += log_alpha[rr[i]];
     neg_sum[nn[i]] += log1m_beta[rr[i]];
  } else {
     pos_sum[nn[i]] += log1m_alpha[rr[i]];
     neg_sum += log_beta[rr[i]];
  }
 }  
 for (n in 1:N)
   target += log_sum_exp(pos_sum[n], neg_sum[n]);
 
 // priors
 w0 ~ normal(0, 5);
 for (d in 1:D){
 w[d] ~ normal(0, 2);
 }
 alpha ~ beta(10, 1);
 beta ~ beta(10, 1);
}
generated quantities {
 vector[N] Pr_z_eq_1;
 vector[N] logit_z_hat = w0 + x * w; 
 vector[N] log_z_hat = log_inv_logit(logit_z_hat);
 vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);
 vector[R] log_alpha = log(alpha);
 vector[R] log1m_alpha = log1m(alpha);
 vector[R] log_beta = log(beta);
 vector[R] log1m_beta = log1m(beta);
 
 vector[N] pos_sum = log_z_hat;
 vector[N] neg_sum = log1m_z_hat;
 for (i in 1:I){
   if (y[i] == 1) { 
     pos_sum[nn[i]] += log_alpha[rr[i]];
     neg_sum[nn[i]] += log1m_beta[rr[i]];
   } else {
     pos_sum[nn[i]] += log1m_alpha[rr[i]];
     neg_sum += log_beta[rr[i]];
   }
 }  
 for (n in 1:N){
   Pr_z_eq_1[n] = softmax([pos_sum[n], neg_sum[n]]')[1];
 }
}
"


data_list <- list(I = nrow(df),
                 N = length(unique(df$item)), 
                 R = length(unique(df$rater)),
                 nn = df$item,
                 rr = df$rater,
                 D = D,
                 x = x, # design matrix : intercept + predictors
                 y = df$y)

fit_long <- stan(model_code = raykar_long,
            model_name = "raykar_long",
            data = data_list,
            init = function(n) list(alpha = rep(0.8, length(unique(df$rater))),
                                    beta = rep(0.8, length(unique(df$rater)))),
            chains = 3,
            verbose = F,
            iter = 8000,
            warmup = 4000, 
            seed = 3,
            refresh = max(8000/100, 1))

ducoveen · April 6, 2020, 8:06am

Hello,

The ragged form I have lying around is the following:

data {
  int<lower=0> I; 
  int<lower=0> J; 
  int<lower=0> N; 
  int<lower=0> D;  // differs from annotation only
  matrix[I,D] x;   // differs from annotation only
  int<lower=0,upper=1> y[N];
  int<lower=1> ii[N];   
  int<lower=1> jj[N];   
}
parameters {
  vector[D] w;   // differs from annotation only
  real w0;      

  vector<lower=0.1,upper=1>[J] alpha; 
  vector<lower=0.1,upper=1>[J] beta;  
}

model {
  vector[I] logit_z_hat;
  vector[I] log_z_hat;   
  vector[I] log1m_z_hat; 

  vector[J] log_alpha;
  vector[J] log1m_alpha;
  vector[J] log_beta;
  vector[J] log1m_beta;

  logit_z_hat = w0 + x * w; // differs from annotation only
  
  for (i in 1:I) {
    log_z_hat[i] = log_inv_logit(logit_z_hat[i]);
    log1m_z_hat[i] = log1m_inv_logit(logit_z_hat[i]);
  }

  for (j in 1:J) {
    log_alpha[j] = log(alpha[j]);
    log1m_alpha[j] = log1m(alpha[j]);
    log_beta[j] = log(beta[j]);
    log1m_beta[j] = log1m(beta[j]);
  }

  w ~ normal(0,2);  // differs from annotation only
  w0 ~ normal(0,5);
  alpha ~ beta(10,1);
  beta ~ beta(10,1);

  for (n in 1:N){
    real pos_sum;
    real neg_sum;
    pos_sum = log_z_hat[ii[n]];
    neg_sum = log1m_z_hat[ii[n]];
    if (y[n] == 1) {
        pos_sum = pos_sum + log_alpha[jj[n]];
        neg_sum = neg_sum + log1m_beta[jj[n]];
      } else {
        pos_sum = pos_sum + log1m_alpha[jj[n]];
        neg_sum = neg_sum + log_beta[jj[n]];
      }
      target += log_sum_exp(pos_sum, neg_sum);
  }

}

Which has been tested with Simulation Based Calibration and should work. @Hemingway, I’ll take a closer look at you model later.

Best, Duco

Hemingway · April 8, 2020, 10:58am

Hi @ducoveen ,
Thanks for the codes. I ran some sims and it works quite well.
These are the codes based on @Bob_Carpenter example:

inv_logit <- function(u) 1 / (1 + exp(-u))

N <- 500 # items
R <- 30 # subj
D <- 20 # predictors

x <- matrix(rnorm(N * D), N, D) 
w <- rnorm(D) # reg coef
w0 <- -2 # intercep

alpha <- runif(R, 0.65, 0.95) #se
beta <- runif(R, 0.65, 0.95) # sp

z <- rep(0, N);

for (n in 1:N)
  z[n] <- rbinom(1, 1, inv_logit(w0 + (x[n, ] %*% w))) # true value

y <- matrix(0, N, R);
for (n in 1:N)
  for (r in 1:R)
    y[n, r] <- rbinom(1, 1, ifelse(z[n], alpha[r], 1 - beta[r])) # reported

library('reshape2')
df <- melt(y)
names(df) <- c("item", "rater", "y")


raykar_long <- "
data {
  int<lower = 0> I;               // # of observations
  int<lower=0> N;                 // # items 
  int<lower=0> R;                 // # raters 
  int<lower = 1, upper = N> nn[I];  // item for observation i
  int<lower = 1, upper = R> rr[I];  // rater for observation i
  int<lower=0, upper=1> y[I];      // outcome: 1 success, 0 failure
  int<lower=0> D;                 // # of predictors for each item
  matrix[N, D] x;                 // [n, d] predictors d for item n
  
    }
parameters {
  vector[D] w;                         // logistic regression coef
  real w0;                             // intercept
  vector<lower=0, upper=1>[R] alpha;   // sensitivity
  vector<lower=0, upper=1>[R] beta;    // specificity
}

model {
  vector[N] logit_z_hat = w0 + x * w; //  
  vector[N] log_z_hat = log_inv_logit(logit_z_hat); // log p
  vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat); // log 1 - p
  
  vector[R] log_alpha = log(alpha);     
  vector[R] log1m_alpha = log1m(alpha); 
  vector[R] log_beta = log(beta); 
  vector[R] log1m_beta = log1m(beta);

  for (i in 1:I){
    real pos_sum;
    real neg_sum;
    pos_sum = log_z_hat[nn[i]];
    neg_sum = log1m_z_hat[nn[i]];
    if (y[i] == 1) {
        pos_sum = pos_sum + log_alpha[rr[i]];
        neg_sum = neg_sum + log1m_beta[rr[i]];
      } else {
        pos_sum = pos_sum + log1m_alpha[rr[i]];
        neg_sum = neg_sum + log_beta[rr[i]];
      }
      target += log_sum_exp(pos_sum, neg_sum);
  }  

  
  // priors
  w0 ~ normal(0, 5);
  for (d in 1:D){
  w[d] ~ normal(0, 2);
  }
  alpha ~ beta(10, 1);
  beta ~ beta(10, 1);
}

generated quantities {
  vector[N] Pr_z_eq_1;
  vector[N] logit_z_hat = w0 + x * w; 
  vector[N] log_z_hat = log_inv_logit(logit_z_hat);
  vector[N] log1m_z_hat = log1m_inv_logit(logit_z_hat);
  vector[R] log_alpha = log(alpha);
  vector[R] log1m_alpha = log1m(alpha);
  vector[R] log_beta = log(beta);
  vector[R] log1m_beta = log1m(beta);
  
  for (i in 1:I){
    real pos_sum;
    real neg_sum;
    pos_sum = log_z_hat[nn[i]];
    neg_sum = log1m_z_hat[nn[i]];
    if (y[i] == 1) {
        pos_sum = pos_sum + log_alpha[rr[i]];
        neg_sum = neg_sum + log1m_beta[rr[i]];
      } else {
        pos_sum = pos_sum + log1m_alpha[rr[i]];
        neg_sum = neg_sum + log_beta[rr[i]];
      }
      Pr_z_eq_1[nn[i]] = softmax([pos_sum, neg_sum]')[1];
  }
}
"

ducoveen · April 8, 2020, 4:56pm

Great to hear that it seems to work well!

Topic		Replies	Views
New preprint: Statistical Models of Repeated Categorical Ratings: The R package rater Publicity rstan	0	892	March 22, 2021
Training a classifier with a probabilistically weighted corpus Modeling	1	336	November 23, 2023
Question re: classification model evaluation General	2	420	October 12, 2020
Belated Case Study Announcements Modeling	0	439	February 14, 2022
Bayesian approach in life course epidemiology using Stan Publicity	1	721	June 17, 2018

Raykar et al. Learning from Crowds model

Related topics