I have a set binary variables for which I know an approximate missclasification error. Each data point comes from a single annotator, but there are many annotators. We also know that in a task for which many annotators had to annotate the same series of data points, annotators would agree about 90% of the time. The original data for the experiment is not available. My question is, how would one go about modelling this type of missclassification error? Ideally, I’d like to include this error term into a multiprobit model, for each binary variable.

You will need to figure out something like what proportion of true zeros are misclassified as ones and what proportion of true ones are misclassified as zeros. In the absence of additional information, you might consider making the (quite strong) assumptions that misclassification errors in both directions are equally likely (i.e. that getting a rating of one conditional on a true zero is just as probable as getting a rating of zero conitional on a true one), are invariant across raters, and are invariant across items. If that’s true, then given some true probability \rho of misclassifying any item, the probability that the two raters will disagree is 2\rho(1-\rho), and you could plug in 0.1 and solve for \rho, presumably taking the solution where both raters are usually right rather than the solution where both raters are usually wrong.

Then you’d have the the log-likelihood of observing a 1 is

log_sum_exp(L(1) + log1m(rho), L(0) + log(rho))

where L(1) is the log-likelihood of the true state being 1 according to the model, and L(0) is the log-likelihood of the true state being 0according to the model.

And likewise the likelihood of observing a 0:

log_sum_exp(L(0) + log1m(rho), L(1) + log(rho))

Note that the assumptions that we used to get here are really aggressive and might not be reasonable to make!

This is very useful, thanks! Assuming I was able to find out overall misclassification probabilities for when the true value is 1, and when the true value is 0, I would then have to estimate two different rho’s, one for the first line and one the second, correct?

Note that the assumptions that we used to get here are really aggressive and might not be reasonable to make!

On the absence of more data on interrater disagreement and overall annotator error, do you know of alternative approaches to incorporate uncertainty in a situation like this?

Maybe to expand a bit on the motivation of question. There are many claims on the correlations of these variables, and how one variable can predict another one, and then claims are made about those coefficients. I’m interested in seeing how the model uncertainty about the estimates increases if we take into account the fact that there is already considerable uncertainty in the annotation of the variables themselves.

If you know these probabilities, then you don’t have to estimate them at all, but yes, you would need to use two different rhos.

To incorporate the uncertainty, we need a model for the uncertainty. That model is going to boil down to a model for the probability of miscalssification when the true state is 1, and the probability of misclassification when the true state is 0. If you want to relax the assumptions, you might need to treat these miscalssification probabilities as unknown parameters to be estimated, presumably subject to the constraint that the overall dataset-wide disagreement rate is roughly 0.1. Some thought might be required to select a good prior to place on these parameters.

You’re in luck. This is the problem that got me into Bayesian stats!

You want to look into the Dawid-Skene model. There’s a simple version in the Stan User’s Guide:

The basic idea is that there’s a latent true value and the label provided by a coder (annotator, rater, labeler, etc.) is a noisy measurement. It’s critical that you don’t do this by weighted voting, but in a way that can adjust. See this recent discussion on @andrewgelman’s blog:

I can also point to this application I did with Becky Passonneau for word sense in NLP—I go over the basics of why inter-annotator agreement isn’t a great idea:

Easy to make these Bayesian as we showed in a later paper. The other paper I’d highly recommend is:

I took this to mean that you do not have the annotations from multiple annotators for a single item, but rather just one annotation, plus the knowledge that this annotation arose from a group of annotators who agree with one another approximately 90% of the time. If instead you have multiple ratings for at least some of the data points, then by all means the approach suggested by @Bob_Carpenter is the way to go (but then I’m not sure what you mean when you say the original data aren’t available).

I took this to mean that you do not have the annotations from multiple annotators for a single item, but rather just one annotation, plus the knowledge that this annotation arose from a group of annotators who agree with one another approximately 90% of the time.

Correct, reading through what Bob posted, it seems it only applies to cases with multiple annotators for the same item?