This is another example of a latent discrete parameter model where the discrete parameters are marginalized out and the marginalized model fit with Stan. In the generated quantities, the posterior for the latent discrete parameter (here the true category of items) is calculated to much higher accuracy than would be provided by MCMC sampling with a discrete parameter.
Be careful, though—this model is based on what is usually a false independence assumption that each rater’s label for an item is chosen independently. In reality, some items are harder to categorize than others and raters tend to make correlated mistakes. As a result, posteriors will be sharper than they should be, as will show up on real-world calibration tests.
Stan program
This should go in file dawid-skene.stan
:
/*
* Implementation of Dawid and Skene's model for measurement error
* among categoical raters. The model is extended for Bayes fixed priors.
* Discrete parameters for category of items marginalized out with their
* posterior probability returned on the log scale in derived quantity log_Pr_z.
*
* Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation
* of observer error-rates using the EM algorithm. Applied Statistics,
* 20--28.
*/
data {
int<lower=2> K; // number of categories
int<lower=1> I; // number of items
int<lower=1> J; // number of coders
int<lower=1,upper=K> y[I,J]; // label for observation n
vector<lower=0>[K] alpha; // prior for prevalence (positive)
vector<lower=0>[K] beta[K]; // prior for coder responses (positive)
}
parameters {
simplex[K] pi; // prevalence of categories
simplex[K] theta[J,K]; // response of anotator j to category k
}
model {
// log prior: log p(theta, pi)
pi ~ dirichlet(alpha);
for (j in 1:J)
for (k in 1:K)
theta[j, k] ~ dirichlet(beta[k]);
// log marginal likelihood: log p(y | theta, pi)
for (i in 1:I) {
vector[K] log_q = log(pi);
for (j in 1:J)
log_q += to_vector(log(theta[j, , y[i, j]]));
target += log_sum_exp(log_q);
}
}
generated quantities {
// normalized log posterior: log Pr[z | y]
vector[K] log_Pr_z[I];
for (i in 1:I) {
vector[K] log_q = log(pi);
for (j in 1:J)
log_q += to_vector(log(theta[j, , y[i, j]]));
log_Pr_z[i] = log_q - log_sum_exp(log_q);
}
}
R simulation and fit code
Note that there is a partial initialization for prevalence and annotator response provided. Without that, the model will label switch among solutions that have annotators being really accurate and really inaccurate. There’s a discussion in the problematic posteriors chapter of the user’s guide.
After the fit, it prints out the posterior mean of the annotator response parameter theta against the true value. I’m too lazy to figure out a way to plot the results. You should probably be running multiple chains in parallel for this as the instructions indicate on loading rstan, but I just set it up with a vanilla call.
library(rstan);
library(MCMCpack);
rcat <- function(n,theta)
sample(length(theta), n, replace = TRUE, prob = theta);
K <- 3;
I <- 500;
J <- 5;
y <- array(0, c(I, J));
alpha <- rep(3, K); # modest regularization
beta <- matrix(1, K, K);
for (k in 1:K)
beta[k, k] <- 2.5 * K; # 75%-ish accuracy
pi <- rdirichlet(1, alpha)
theta <- array(0, c(J, K, K))
for (j in 1:J)
for (k in 1:K)
theta[j, k, ] <- rdirichlet(1, beta[k, ]);
z <- rcat(I, pi);
for (i in 1:I)
for (j in 1:J)
y[i ,j] <- rcat(1,theta[j, z[i], ]);
theta_init <- array(.2 / (K - 1), c(J, K, K))
for (j in 1:J)
for (k in 1:K)
theta_init[j, k, k] <- 0.8
pi_init <- rep(1 / K, K)
fit <- stan("dawid-skene.stan",
data = c("K", "I", "J", "y", "alpha", "beta"),
init = function(n) list(theta = theta_init, pi = pi_init),
chains = 4, iter = 2000);
fit_ss <- extract(fit);
for (j in 1:J)
for (k_ref in 1:K)
for (k_resp in 1:K)
print(sprintf("(%d, %d, %d) theta=%5.3f hat-theta=%5.3f",
j, k_ref, k_resp, theta[j, k_ref, k_resp],
mean(fit_ss$theta[, j, k_ref, k_resp])), quote=FALSE);
People asked me about licensing, so I’m releasing this code like the rest of Stan under the BSD 3-clause license with copyright holder being Columbia University in the City of New York.