Working Dawid-Skene example with simulation

This is another example of a latent discrete parameter model where the discrete parameters are marginalized out and the marginalized model fit with Stan. In the generated quantities, the posterior for the latent discrete parameter (here the true category of items) is calculated to much higher accuracy than would be provided by MCMC sampling with a discrete parameter.

Be careful, though—this model is based on what is usually a false independence assumption that each rater’s label for an item is chosen independently. In reality, some items are harder to categorize than others and raters tend to make correlated mistakes. As a result, posteriors will be sharper than they should be, as will show up on real-world calibration tests.

Stan program

This should go in file dawid-skene.stan:

/*
 * Implementation of Dawid and Skene's model for measurement error
 * among categoical raters.  The model is extended for Bayes fixed priors.
 * Discrete parameters for category of items marginalized out with their
 * posterior probability returned on the log scale in derived quantity log_Pr_z.
 *
 * Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation
 * of observer error-rates using the EM algorithm. Applied Statistics,
 * 20--28.
 */
data {
  int<lower=2> K;               // number of categories
  int<lower=1> I;               // number of items
  int<lower=1> J;               // number of coders
  int<lower=1,upper=K> y[I,J];  // label for observation n
  vector<lower=0>[K] alpha;     // prior for prevalence (positive)
  vector<lower=0>[K] beta[K];   // prior for coder responses (positive)
}
parameters {
  simplex[K] pi;                // prevalence of categories
  simplex[K] theta[J,K];        // response of anotator j to category k
}
model {
  // log prior:  log p(theta, pi)
  pi ~ dirichlet(alpha);
  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(beta[k]);

  // log marginal likelihood: log p(y | theta, pi)
  for (i in 1:I) {
    vector[K] log_q = log(pi);
    for (j in 1:J)
      log_q += to_vector(log(theta[j, , y[i, j]]));
    target += log_sum_exp(log_q);
  }
}
generated quantities {
  // normalized log posterior: log Pr[z | y]
  vector[K] log_Pr_z[I];
  for (i in 1:I) {
    vector[K] log_q = log(pi);
    for (j in 1:J)
      log_q += to_vector(log(theta[j, , y[i, j]]));
    log_Pr_z[i] = log_q - log_sum_exp(log_q);
  }
}

R simulation and fit code

Note that there is a partial initialization for prevalence and annotator response provided. Without that, the model will label switch among solutions that have annotators being really accurate and really inaccurate. There’s a discussion in the problematic posteriors chapter of the user’s guide.

After the fit, it prints out the posterior mean of the annotator response parameter theta against the true value. I’m too lazy to figure out a way to plot the results. You should probably be running multiple chains in parallel for this as the instructions indicate on loading rstan, but I just set it up with a vanilla call.

library(rstan);
library(MCMCpack);

rcat <- function(n,theta)
          sample(length(theta), n, replace = TRUE, prob = theta);
K <- 3;
I <- 500;
J <- 5;
y <- array(0, c(I, J));
alpha <- rep(3, K);       # modest regularization
beta <- matrix(1, K, K);
for (k in 1:K)
  beta[k, k] <- 2.5 * K;  # 75%-ish accuracy
pi <- rdirichlet(1, alpha)
theta <- array(0, c(J, K, K))
for (j in 1:J)
  for (k in 1:K)
    theta[j, k, ] <- rdirichlet(1, beta[k, ]);
z <- rcat(I, pi);
for (i in 1:I)
  for (j in 1:J)
    y[i ,j] <- rcat(1,theta[j, z[i], ]);

theta_init <- array(.2 / (K - 1), c(J, K, K))
for (j in 1:J)
  for (k in 1:K)
      theta_init[j, k, k] <- 0.8
pi_init <- rep(1 / K, K)

fit <- stan("dawid-skene.stan",
            data = c("K", "I", "J", "y", "alpha", "beta"),
            init = function(n) list(theta = theta_init, pi = pi_init),
            chains = 4, iter = 2000);

fit_ss <- extract(fit);
for (j in 1:J)
  for (k_ref in 1:K)
    for (k_resp in 1:K)
      print(sprintf("(%d, %d, %d) theta=%5.3f hat-theta=%5.3f",
                  j, k_ref, k_resp, theta[j, k_ref, k_resp],
                  mean(fit_ss$theta[, j, k_ref, k_resp])), quote=FALSE);

People asked me about licensing, so I’m releasing this code like the rest of Stan under the BSD 3-clause license with copyright holder being Columbia University in the City of New York.

2 Likes

Hi Bob,
Thank you a lot for this new implementation. Does it have specific advantages over the one in the manual?

If anyone is interested in another one, here is one which I built with the help of @Bob_Carpenter in another Discourse thread based on the example in the manual. Some features:

  • Gold labels for some or all units possible
  • Unbalanced data (“long” format): not every unit has to coded by each coder
  • Independent samples with different category prevalences
  • Imputed labels in generated quantities

The model requires initialization similar to Bob’s model above.

Any feedback is of course welcome.

data {
  int<lower=2> K; // number of categories
  int<lower=1> J; // number of coders
  vector<lower=0>[K] beta[J, K]; // dirichlet prior on probability that coder j assigns category k to coding unit i
  int<lower=1> I; // number of coding units
  int<lower=1> N; // number of codings
  int<lower=1> S; // number of independent samples
  int<lower=1,upper=K> y[N]; // codings
  int<lower=1,upper=I> ii[N]; // index coding units in data
  int<lower=1,upper=J> jj[N]; // index coder in data
  int<lower=1,upper=S> ss[N]; // index sample in data
  int<lower=-1,upper=K> y_true[N]; // gold codings per coding; -1 if no gold coding available
  // int<lower=-1,upper=K> y_true_I[I]; // gold codings per unit
  // int<lower=1,upper=S> sss[I]; // index sample to coding units
  vector<lower=0>[K] alpha[S]; // dirichlet prior on category prevalence in studies
}
transformed data {
  // super awkward way of aggregating gold codings to unit level
  // at least it works and renders another data input unnecessary...
  int<lower=1,upper=S> sss[I]; // index sample to coding units
  int<lower=-1,upper=K> y_true_I[I]; // gold codings per unit
  int<lower=-1,upper=K>any_gold;
  matrix<lower=0>[I, S] ss_table = rep_matrix(0, I, S);
  matrix<lower=0>[I, K] yt_table = rep_matrix(0, I, K);

  for (n in 1:N)
    ss_table[ii[n], ss[n]] += 1;
  for (i in 1:I)
    sss[i] = sort_indices_desc(ss_table[i])[1];

  for (n in 1:N)
    if(y_true[n] != -1)
      yt_table[ii[n], y_true[n]] += 1;
  for (i in 1:I) {
    y_true_I[i] = sum(yt_table[i]) == 0 ? -1 : sort_indices_desc(yt_table[i])[1];
    for (k in 1:K)
      yt_table[i, k] = yt_table[i, k] == 0 ? 0 : 1;
  }

  any_gold = max(y_true_I);
}
parameters {
  simplex[K] pi[S]; // category prevalence in independent samples (with non-differential error assumption)
  simplex[K] theta[J, K]; // misclassification probablilties for each coder (no pooling)
}
transformed parameters {
  vector[K] log_q_z[I];
  vector[K] log_theta = log(theta);
  vector[K] log_pi = log(pi);

  for (i in 1:I)
    log_q_z[i] = log_pi[sss[i]];
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log_theta[jj[n], k, y[n]];
}
model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(beta[j, k]);

  // no gold codings
  if (any_gold == -1) {
    for (i in 1:I)
      target += log_sum_exp(log_q_z[i]);
  } else {
  // units without gold coding
    for (i in 1:I) {
      if (y_true_I[i] == -1) {
        target += log_sum_exp(log_q_z[i]);
      }
    }
  // units with gold coding
    for (n in 1:N) {
      if (y_true[n] != -1) {
        y[n] ~ categorical(theta[jj[n], y_true[n]]);
        y_true[n] ~ categorical(pi[ss[n]]);
      }
    }
  }
}
generated quantities {
  vector[K] softmax_lqz[I]; // probabilities of each k for each coding unit
  int y_imp[I];

  for (i in 1:I)
    softmax_lqz[i] = y_true_I[i] == -1 ? softmax(log_q_z[i, ]) : to_vector(yt_table[i, ]);

  for (i in 1:I)
    y_imp[i] = categorical_rng(softmax_lqz[i]);
}

I just updated the syntax for the one I had lying around.