Dawid-Skene annotation model with partial pooling of annotators

Marko · May 8, 2017, 4:46pm

I want to implement a version of the Dawid-Skene model which is described in the manual (p. 217). Bob has pointed me to the model in an earlier thread (https://groups.google.com/d/topic/stan-users/G1nPD2TbZ08/discussion).

The manual states “because there are J coders, it would be natural to extend the model to include a hierarchical prior for and to partially pool the estimates of coder accuracy and bias.” My goal is to implement such a partial pooling model. I started with a modification of the no pooling model in the manual (Full model codes are pasted below, new users are unfortunately not allowed to upload files). My implementation differs somewhat from the manual because I wanted to allow for missing data (not every unit is seen by each annotator) and include data from independent studies with different (but unknown) prevalences of the true categories.
I then followed the partial pooling vignette (http://mc-stan.org/documentation/case-studies/pool-binary-trials.html):

The common misclassification probabilities eta are sampled from a Dirichlet prior:

  for (k in 1:K)
    eta[k] ~ dirichlet(beta[k]);

The weight of the common misclassification matrix kappa relative to the data is sampled from a Pareto prior:

  kappa ~ pareto(1, 1.5);

The misclassification matrices for all coders theta[j] are obtained by combining both:

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(eta[k] * kappa);

The parameters are recovered well from simulated data. However, in smallish (but typical for my field) data sets (for example, with I = 50 test units which are all coded by the same J = 7 coders), there are almost always a small number of divergent transitions. If I understood the warning guide correctly, even a small number of divergent transitions invalidates the result - even if the true parameters are generally recovered quite well in different simulations. The problem of small data make sense to me - there is just not much information from which to approximate a common misclassification matrix.

So my questions are

Does this approach to partial pooling of coder abilities make sense in general or are the divergent transitions an indicator for a fundamental problem?
Is there an (easy) solution to the divergent transitions problem or is it a sign that there is just not enough data to estimate the partial pooling model? (in the second case, I would consequently estimate the no pooling model (with larger uncertainty in many setups) or apply the interchangeable coder assumption of a complete pooling model (with less uncertainty, but probably larger bias)).

Thank you in advance for any advice!

P.S.: Bob already suggested that the code would be more efficient with vectorization. This is next on my list after I have a grasp on the partial pooling problem.

Model code here (I think it would be helpful to allow new users to upload files, this would seem to be an easier solution; I would then also provide example data…):

Partial pooling:

data {
  int<lower=2> K; // number of categories
  vector<lower=0>[K] beta[K]; // dirichlet prior on probability that coder j assigns category k to coding unit i
  int<lower=1> J; // number of coders
  int<lower=1> I; // number of coding units
  int<lower=1> N; // number of codings
  int<lower=1> S; // number of independent samples
  int<lower=1,upper=K> y[N]; // codings
  int<lower=1,upper=I> ii[N]; // index coding units in data
  int<lower=1,upper=J> jj[N]; // index coder in data
  int<lower=1,upper=S> ss[N]; // index sample in data
  int<lower=1,upper=I> iii[I]; // index coding units to sample (not necessary, for completeness)
  int<lower=1,upper=S> sss[I]; // index sample to coding units
  vector<lower=0>[K] alpha[S]; // dirichlet prior on category prevalence in studies
}
parameters {
  simplex[K] pi[S]; // category prevalence in independent samples (with non-differential error assumption) 
  simplex[K] theta[J, K]; // misclassification probablilties for each coder (partial pooling)
  simplex[K] eta[K]; // overall misclassification probablilties (partial pooling)
  real<lower=1> kappa; // amount of pooling (see http://mc-stan.org/documentation/case-studies/pool-binary-trials.html)
}
transformed parameters {
  vector[K] log_q_z[I];

  for (i in 1:I)
    log_q_z[i] = log(pi[sss[i]]);
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[jj[n], k, y[n]]);
}
model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);

  kappa ~ pareto(1, 1.5);
  
  for (k in 1:K)
    eta[k] ~ dirichlet(beta[k]);

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(eta[k] * kappa);
    
  for (i in 1:I)
    target += log_sum_exp(log_q_z[i]);
}
generated quantities {
  vector[K] softmax_lqz[I]; // probabilities of each k for each coding unit
  for (i in 1:I)
    softmax_lqz[i] = softmax(log_q_z[i, ]);
}

No pooling (adapted from the manual):

data {
  int<lower=2> K; // number of categories
  vector<lower=0>[K] beta[K]; // dirichlet prior on probability that coder j assigns category k to coding unit i
  int<lower=1> J; // number of coders
  int<lower=1> I; // number of coding units
  int<lower=1> N; // number of codings
  int<lower=1> S; // number of independent samples
  int<lower=1,upper=K> y[N]; // codings
  int<lower=1,upper=I> ii[N]; // index coding units in data
  int<lower=1,upper=J> jj[N]; // index coder in data
  int<lower=1,upper=S> ss[N]; // index sample in data
  int<lower=1,upper=I> iii[I]; // index coding units to sample (not necessary, for completeness)
  int<lower=1,upper=S> sss[I]; // index sample to coding units
  vector<lower=0>[K] alpha[S]; // dirichlet prior on category prevalence in studies
}
parameters {
  simplex[K] pi[S]; // category prevalence in independent samples (with non-differential error assumption) 
  simplex[K] theta[J, K]; // misclassification probablilties for each coder (no pooling)
}
transformed parameters {
  vector[K] log_q_z[I];

  for (i in 1:I)
    log_q_z[i] = log(pi[sss[i]]);
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[jj[n], k, y[n]]);
}
model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(beta[k]);
    
  for (i in 1:I)
    target += log_sum_exp(log_q_z[i]);
}
generated quantities {
  vector[K] softmax_lqz[I]; // probabilities of each k for each coding unit
  for (i in 1:I)
    softmax_lqz[i] = softmax(log_q_z[i, ]);
}

Bob_Carpenter · May 8, 2017, 11:41pm

Nice!

(1) Yes, you can have a common kappa like that, but you could also have it vary by k.

(2) Have you tried upping the adapt_delta parameter to something like 0.95 or 0.99? If you only have a few divergent transitions, that should get rid of them. You may also need to start with a smaller stepsize or run warmup longer to get better adaptation.

A second approach would be stronger priors.

A third approach would be to move over to a logistic scale rather than using Dirichlets. That also lets you model covariance if you use a multivariate logistic normal (log odds from a multi-normal then soft-maxed).

P.S. I prefer the models pasted into the message. Much easier for me than opening in another app.

Marko · May 9, 2017, 10:36am

Bob, thank you for the quick reply.

That’s a great suggestion especially for larger data sets, thanks. Did I get it right that you meant something like

parameters {
   ...
  real<lower=1> kappa[K];
}
model {
   ...
  kappa ~ pareto(1, 1.5);
  ...
  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(eta[k] * kappa[k]);
   ...
}

This worked great in some first informal simulations with balanced data sets with about I>=100 units and J>=10 coders. For most applied problems (manual content analysis with a few research assistants as coders in communication research), I assume that I will have to stick with the common kappa, because there is often too little data even for the simple model. Or maybe something on the kappas to keep them closer together if there is little empirical information (<- just documenting some thoughts for later here).

Sorry, I forgot to mention that I already increased adapt_delta to 0.99; this reduced the number of divergent transitions, but some remained in most cases. Stronger priors (+ longer warmup in some instances) work quite well.
At the moment, I started with the mildly informative priors from the manual (2.5 * k in the diagonal, 1 in the off-diagonal). I think that multiplying beta in the Dirichlet input with some integer (e.g., beta * 10) is an intuitive (albeit probably somewhat simplistic) way to communicate this strategy for an audience which is not at all familiar with this kind of models. The simple advice would then be “You have to use stronger priors until the divergent transitions go away, because the data alone does not provide enough information”. Is this at all reasonable advice to give to applied users, or would this put them on the wrong track?

Do you mean something like in the example in the manual on pp. 270-271? I fear that this transformation would be extremely difficult to communicate to my intended audience. Having a model of the coding process at all instead of just analysing the raw coding data is already a big step, and using informative priors is somewhat controversial. What I like about the Dirichlet parametrization is that it translates quite directly to the simplex misclassification matrix. A plot brings across the intuitive idea quite well without going into the specifics (see below for an example I have in mind to explain the suggested 2.5*K, 1 prior from the manual).
Are there any similarly straightforward ways to communicate the priors in the multivariate logistic normal model?

Thanks again!

sakrejda · May 9, 2017, 10:50am

I just wrote an answer like this, we should just make people a flow chart!

Bob_Carpenter · May 9, 2017, 6:32pm

Yes, exactly what I meant for kappa.

The way to motivate priors is to show that they work. Hierarchical priors are easier to motivate a priori—they’re just random effects models. The hyperpriors above those are usually not very sensitive. I go through both this and the logistic approach in my case study on repeated binary trials (very related to all this work).

You can sometimes remove divergences with priors, especially if they are due to non-identifiabilities. See the problematic posteriors chapter of the manual. In these cases, the divergences often correspond to regions of the posterior you don’t want to explore anyway based on prior knowledge. Soemtimes you can reparameterize—again we discuss this in the manual (I think in the optimization chapter).

For priors for multivariate logistic normal, you can include covariance. You still get a distribution over simplexes—it’s just richer. You can communicate with logistic models directly on the log odds scale (nicely additively symmetric around zero as opposed to odds being multiplicatively symmetric around 1), or you can convert back to probabilities and look at the implied simplexes.

betanalpha · May 9, 2017, 7:16pm

To variations in scale, but you have to have something at least weakly-informative! In particular, uniform priors on the hypervariance strongly biases the model towards the non-hierarchical limit!

Bob_Carpenter · May 11, 2017, 1:32pm

How much does the group size matter? I believe Andrew still has improper priors littered throughout BDA, so we should warn people more strongly if it’s always an issue.

In the annotation models, there are typically hundreds if not thousands of items, and dozens if not hundreds of annotators. I tried more and less diffuse prirors, but nothing improper.

Marko · May 12, 2017, 8:28am

Dear Bob - thanks again for the advice.

I tried to change the parametrization of the misclassification probabilities from dirichlet to multivariate logistic normal, but without success. I started with the simple no pooling model (see below for the full model). Instead of

model {
  ...
  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(beta[k]);
  ...

and the respective data and parameter declarations, the relevant parts of the model are

data {
  ...
  vector[K] mu[K]; // mean vectors for multi_normal
  cov_matrix[K] sigma[K]; // covariance matrices for multi_normal
  ...
}
parameters {
  ...
  vector[K] theta_raw[J, K]; // misclassification probabilities for each coder on log odds scale
}
transformed parameters {
  ...
  simplex[K] theta[J, K]; // misclassification probabilities for each coder on probability scale

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] = softmax(theta_raw[j, k]);
  ...        
}
model {
  ...
  for (j in 1:J)
    for (k in 1:K)
      theta_raw[j, k] ~ multi_normal(mu[k], sigma[k]);
}

Everything else remained as in the first no pooling model adapted from the manual. The new model does not converge at all, with many Rhat > 1 and n_eff < 100 from 8.000 post-warmup draws. I tried several values for mu and sigma (e.g., mu with 3 in the diagonal and 1 in the off-diagonal and sigma with 1 in the diagonal and 0s for covariances, which is quite similar to the dirichlet(2.5*k, 1) after softmax). I also used large simulated data sets so the priors would not be as important, but again without success.

It seems like I am missing something obvious, but I cannot figure it out.

Full model (see second model at the bottom of this post for comparison: Dawid-Skene annotation model with partial pooling of annotators)

data {
  int<lower=2> K; // number of categories
  vector[K] mu[K]; // mean vectors for multi_normal
  cov_matrix[K] sigma[K]; // covariance matrices for multi_normal
  int<lower=1> J; // number of coders
  int<lower=1> I; // number of coding units
  int<lower=1> N; // number of codings
  int<lower=1> S; // number of independent samples
  int<lower=1,upper=K> y[N]; // codings
  int<lower=1,upper=I> ii[N]; // index coding units in data
  int<lower=1,upper=J> jj[N]; // index coder in data
  int<lower=1,upper=S> ss[N]; // index sample in data
  int<lower=1,upper=I> iii[I]; // index coding units to sample (not necessary, for completeness)
  int<lower=1,upper=S> sss[I]; // index sample to coding units
  vector<lower=0>[K] alpha[S]; // dirichlet prior on category prevalence in studies
}
parameters {
  simplex[K] pi[S]; // category prevalence in independent samples (with non-differential error assumption) 
  vector[K] theta_raw[J, K]; // misclassification probabilities for each coder on log odds scale
}
transformed parameters {
  vector[K] log_q_z[I];
  simplex[K] theta[J, K]; // misclassification probabilities for each coder on probability scale

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] = softmax(theta_raw[j, k]);
    
  for (i in 1:I)
    log_q_z[i] = log(pi[sss[i]]);
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[jj[n], k, y[n]]);
}
model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);

  for (j in 1:J)
    for (k in 1:K)
      theta_raw[j, k] ~ multi_normal(mu[k], sigma[k]);
    
  for (i in 1:I)
    target += log_sum_exp(log_q_z[i]);
}
generated quantities {
  vector[K] softmax_lqz[I]; // probabilities of each k for each coding unit
  for (i in 1:I)
    softmax_lqz[i] = softmax(log_q_z[i, ]);
}

Bob_Carpenter · May 15, 2017, 6:37pm

There are a bunch of potential problems here.

You don’t have a prior on your covariance matrix. You can look at our suggestions in the manual for using a scaled Cholesky factor of a correlation matrix (it’s in the regression chapter). It’s better statistically in most cases and also better computationally.

There are identifiability issues with softmax which favor a K - 1 parameter version.

I think you may be summing on the log scale when you want the log-sum-exp scale, but I’d have to work through the math.

Marko · May 16, 2017, 4:13pm

Bob, thanks again so much. I realize now that the change of the parametrization is more complex than I had first naively thought. I will work through the relevant chapters of the manual and the case studies, but this will most likely take a while. I am learning about Stan and probabilistic modelling in a work-in-progress fashion, trying to pick up what is needed when problems occur. Many things are very intuitive, but obviously I am also missing some basics… I sincerely hope it does not bother you to give feedback during this somewhat unstructured process and I really appreciate the advice.

In this spirit, I hope to get a sanity check on another modification of the Dawid-Skene model. I started with the no pooling model, but the approach would work identically with the partial or complete pooling models. The idea is to incorporate the possibility that some coding units have known gold labels into the model. My current approach is the following (full model at the end of the post):

The simplex array gold contains the known probabilities for each k and each unit with a known label. In my applied cases, each simplex will usually contain one 1 and k-1 0s, but soft labels are also possible in principle. The vector ggg indexes each coding unit to the gold array (-1 for units without gold label), and the vectors ss and sss, which index each coding or coding unit, respectively, to independent samples, is 0 if the coding unit comes from the sample of units for which the label is known. In Stan

data {
  ...
  int<lower=0,upper=S> ss[N]; // index sample in data, 0 = sample with gold units
  int<lower=0,upper=S> sss[I]; // index sample to coding units, 0 = sample with gold units
  int<lower=-1,upper=G> ggg[I]; // index coding units to gold labels
  simplex[K] gold[G]; // gold standard probabilities
}

The model (in the transformed parameter block) checks if the label for a unit is known. If so, it takes the log of the known label probabilities. If not, it samples from dirichlet(alpha) as before.

transformed parameters {
  vector[K] log_q_z[I];

  for (i in 1:I)
    log_q_z[i] = sss[i] == 0 ? log(gold[ggg[i]]) : log(pi[sss[i]]);
  ...
}

The models recovers the parameters from simulated data well. The comparison with the fit of a model without the additional gold labels looks as expected: the estimates of theta are more precise.

Does this approach make sense? Another idea I had was to include the gold labels as codings from an additional annotator who has a known perfect accuracy (i.e., 1 in the diagonal of theta), but I did not know how to implement this in a Stan program. Would this also be feasible?

Full model (see second model at the bottom of this post for comparison: Dawid-Skene annotation model with partial pooling of annotators)

data {
  int<lower=2> K; // number of categories
  vector<lower=0>[K] beta[K]; // dirichlet prior on probability that coder j assigns category k to coding unit i
  int<lower=1> J; // number of coders
  int<lower=1> I; // number of coding units
  int<lower=1> N; // number of codings
  int<lower=1> S; // number of independent samples
  int<lower=0> G; // number of units with gold label
  int<lower=1,upper=K> y[N]; // codings
  int<lower=1,upper=I> ii[N]; // index coding units in data
  int<lower=1,upper=J> jj[N]; // index coder in data
  int<lower=0,upper=S> ss[N]; // index sample in data
  int<lower=1,upper=I> iii[I]; // index coding units to sample (not necessary, for completeness)
  int<lower=0,upper=S> sss[I]; // index sample to coding units, 0 = sample with gold units
  int<lower=-1,upper=G> ggg[I]; // index coding units to gold labels
  vector<lower=0>[K] alpha[S]; // dirichlet prior on category prevalence in studies
  simplex[K] gold[G]; // gold standard probabilities
}
parameters {
  simplex[K] pi[S]; // category prevalence in independent samples (with non-differential error assumption) 
  simplex[K] theta[J, K]; // misclassification probabilities for each coder (no pooling)
}
transformed parameters {
  vector[K] log_q_z[I];

  for (i in 1:I)
    log_q_z[i] = sss[i] == 0 ? log(gold[ggg[i]]) : log(pi[sss[i]]);
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[jj[n], k, y[n]]);
}
model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(beta[k]);
    
  for (i in 1:I)
    target += log_sum_exp(log_q_z[i]);
}
generated quantities {
  vector[K] softmax_lqz[I]; // probabilities of each k for each coding unit
  for (i in 1:I)
    softmax_lqz[i] = softmax(log_q_z[i, ]);
}

Bob_Carpenter · May 16, 2017, 5:55pm

Yes, known gold standards are easy to incorporate. I would be reluctant to do that, because I don’t trust the golden annotators (I’ve done too many of these in the field!). An alternative is to treat them like other annotations, but give the golden annotators high fixed accuracies or give them high accuracy and concentrated priors.

This kind of loop is awkward in the code:

for (i in 1:I)
    log_q_z[i] = sss[i] == 0 ? log(gold[ggg[i]]) : log(pi[sss[i]]);

I also don’t know what gold[] is supposed to be here. You aren’t going to be able to pin the result this way (unless you take my suggestion above). What you need to do is fix z[i] in the cases where the gold standard for item i is observed.

I’m going to write up a case study for this in the next month or so—I’m going to cover all the questions you asked there, but it’s not going to be soon enough if you need answers today.

Marko · May 17, 2017, 2:17pm

That would be great :) I am not under any time pressure and will continue to work on my model attempts in the meantime - really looking forward to see how it should be done correctly.

Just for clarification: My model does indeed fix z[i], although probably in a somewhat awkward way. gold[] is a two-dimensional GxK array where G is the number of units with known (gold) labels and K is the number of categories. Each row contains the probabilities that a unit is of label k. Most of the time, there will be one 1 and K-1 0s, because the unit is known to have a certain label - but soft labels (like in the last row of the example) would also be possible.

Example of gold with G = 5 and K = 3:

The awkward line does the following:

For each unit 1:I
  check if this unit has a known (gold) label (as indicated by sss == 0)
  if true: take the log of the respective row of gold[] (as indexed by ggg[i])
  else: take the log of the estimated category prevalence in the respective sample (as indexed by sss[i])

I am sure that there are more elegant ways to achieve this, but this way also seems to work according to all my tests (e.g., behavior of estimates of theta, values of softmax_lqz in generated quantities equal gold label with 0-wide intervals).

I hope this makes my specification more comprehensible.

Bob_Carpenter · May 18, 2017, 4:22am

I still don’t understand that line for 1:I — why would you take the log of a constant and then what do you do with it?

What you need to do with a gold standard is use it to pin the z[i] for the relevant category. But then I don’t know how you do that with “soft” labels.

In general, there’s no point adding the log density of a constant, which still seems to be what you’re doing here.

Marko · May 26, 2017, 12:43am

Sorry for the delay, I was traveling with infrequent Internet access.

Yes, the for (i in 1:I) loop does only add the log of a constant if the gold label for unit i exists. This does not help, but (as far as my limited understanding goes) it also does not hurt. In my setup, this is necessary for the following for (n in 1:N) loop, where the category prevalence and the misclassification probabilities theta come together. A value of log_q_z for every unit i must exist for the first part of the sum.

  for (i in 1:I)
    log_q_z[i] = sss[i] == 0 ? log(gold[ggg[i]]) : log(pi[sss[i]]);
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[jj[n], k, y[n]]);

I guess this approach is rather inefficient, but (at least for me) it seemed to be the most straightforward way to incorporate the known z[i] within the current setup and also allow for “soft” gold labels.

Bob_Carpenter · May 26, 2017, 1:00am

If you truly have a bunch of cases, let’s say 1:K, with known values z[i]. Then all you need to do is this:

for (k in 1:K)
  y[k] ~ categorical(theta[annotator[k], z[item[k]]);

where y[k] is the k-th annotation of an item with known value, annotator[k] is the annotator ID and item[k] is the item ID.

Marko · September 8, 2017, 5:54pm

It has been a while since I was able to work on the open problems, namely the inclusion of “gold standard” annotations and the implementation of a multivariate normal prior on the logistic scale instead of the dirichlet over simplexes. I now came up with some quite simple solutions that work alright for my immediate purposes. I want to document them here in case anyone may google a related problem in the future - and hope they any make sense at all.

“Golden” annotators
I decided to include “golden” annotations just like any other annotations but with a much stronger prior on the diagonal of the misclassification matrix. This is simple in the no pooling model where all annotators are supposed to be independent anyway. In the original model with dirichlet priors, this looks like this (full model at the end of the post).

data {
  int<lower=2> K; // number of categories
  int<lower=1> J; // number of coders
  vector<lower=0>[K] beta[J, K]; // dirichlet priors on probability that coder j assigns category k to coding unit i
  ...

Each annotator j could now, in principle, have a different prior. In practice, I followed the suggestion from the manual for “ordinary” annotators (K*2.5 in the diagonal, 1 in the off-diagonal) and something like K*25 for the “golden” annotator diagonal.

The assignment in the model block must now index to the respective prior of each annotator:

model {
  ...
  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(beta[j, k]);
  ...

In the complete and partial pooling models, I did not want the “golden” annotator’s ability estimates pooled together with the other annotators. I therefore assigned the annotators to groups for which separate joint abilities are estimated. For example, in the complete pooling model with dirichlet priors:

data {
  int<lower=2> K; // number of categories
  int<lower=1> C; // number of independent groups of coders
  vector<lower=0>[K] beta[C, K]; // dirichlet prior on probability that coder from group c assigns category k to coding unit i
  ...
  int<lower=1,upper=C> cc[N]; // index coder group in data
  ...
  int<lower=1,upper=C> ccc[J]; // index coders to coder groups
  ...
transformed parameters {
  ...
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[cc[n], k, y[n]]);
}
model {
  ...
  for (j in 1:J)
    for (k in 1:K)
      theta[ccc[j], k] ~ dirichlet(beta[ccc[j], k]);

The are C (usually 2) different priors beta and, similarly, C (usually 2) different misclassification matrices theta. The assignment is done in the cc and ccc indices.

The solution with “gold priors” works well for me and has the advantage that several “ordinary” annotators can overrule the “golden” annotator. It is also flexible to include more than one “golden” annotators. In my field this could be, for example, Phd students who supervise the student assistants.

Multivariate normal prior on logistic scale and partial pooling
I used multi_normal_cholesky() as advised by @Bob_Carpenter. The amount of pooling (i.e., the deviations of the individual annotators from the common misclassification probabilities) is regulated with sigma.

model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);

    
  for (j in 1:J)
    for(k in 1:K) {
      eta_raw[ccc[j], k] ~ multi_normal_cholesky(mu[ccc[j], k], L_Sigma[ccc[j], k]);
      kappa[j, k] ~ normal(0, sigma);
    }

In the transformed parameter block, the individual deviations of each annotator (kappa[j, ]) and the common misclassification matrix eta_raw are added on the log scale and the softmax function is then used to get the misclassification probabilities:

transformed parameters {
  simplex[K] theta[J, K];
  ...
  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] = softmax(eta_raw[ccc[j], k] + kappa[j, k]);
  ...

For K = 3 categories, a vector of means of the multivariate normal mu with 2 or 2.5 for the diagonal (correct annotations) and 0 for the off-diagonal and the Cholesky factor of

     [,1] [,2] [,3]
[1,]  1.0  0.5  0.5
[2,]  0.5  1.0  0.0
[3,]  0.5  0.0  1.0

for L_Sigma perform quite similarly to dirichlet(7.5, 1, 1).

It would maybe make more sense to translate the whole model to the logistic scale, but this adaption of the original Dawid-Skene model from the manual seems to be good enough for my current purposes. No divergent transitions occur any more :)

Hopefully my ideas are not totally wrong and will be of help for others who find this thread. A huge thank you again to @Bob_Carpenter!
Any feedback, alerts to problems, or suggestions for improvements are of course welcome.

Appendix
Full model: No pooling, dirichlet prior

data {
  int<lower=2> K; // number of categories
  int<lower=1> J; // number of coders
  vector<lower=0>[K] beta[J, K]; // dirichlet prior on probability that coder j assigns category k to coding unit i
  int<lower=1> I; // number of coding units
  int<lower=1> N; // number of codings
  int<lower=1> S; // number of independent samples
  int<lower=1,upper=K> y[N]; // codings
  int<lower=1,upper=I> ii[N]; // index coding units in data
  int<lower=1,upper=J> jj[N]; // index coder in data
  int<lower=1,upper=S> ss[N]; // index sample in data
  int<lower=1,upper=I> iii[I]; // index coding units to sample (not necessary, for completeness)
  int<lower=1,upper=S> sss[I]; // index sample to coding units
  vector<lower=0>[K] alpha[S]; // dirichlet prior on category prevalence in studies
}
parameters {
  simplex[K] pi[S]; // category prevalence in independent samples (with non-differential error assumption) 
  simplex[K] theta[J, K]; // misclassification probablilties for each coder (no pooling)
}
transformed parameters {
  vector[K] log_q_z[I];

  for (i in 1:I)
    log_q_z[i] = log(pi[sss[i]]);
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[jj[n], k, y[n]]);
}
model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(beta[j, k]);
    
  for (i in 1:I)
    target += log_sum_exp(log_q_z[i]);
}
generated quantities {
  vector[K] softmax_lqz[I]; // probabilities of each k for each coding unit
  for (i in 1:I)
    softmax_lqz[i] = softmax(log_q_z[i, ]);
}

Full model: Complete pooling with annotator groups, dirichlet prior

data {
  int<lower=2> K; // number of categories
  int<lower=1> C; // number of independent groups of coders
  vector<lower=0>[K] beta[C, K]; // dirichlet prior on probability that coder from group c assigns category k to coding unit i
  int<lower=1> I; // number of coding units
  int<lower=1> J; // number of coders
  int<lower=1> N; // number of codings
  int<lower=1> S; // number of independent samples
  int<lower=1,upper=K> y[N]; // codings
  int<lower=1,upper=I> ii[N]; // index coding units in data
  int<lower=1,upper=S> ss[N]; // index sample in data
  int<lower=1,upper=C> cc[N]; // index coder group in data
  int<lower=1,upper=I> iii[I]; // index coding units to sample (not necessary, for completeness)
  int<lower=1,upper=S> sss[I]; // index sample to coding units
  int<lower=1,upper=C> ccc[J]; // index coders to coder groups
  vector<lower=0>[K] alpha[S]; // dirichlet prior on category prevalence in studies
}
parameters {
  simplex[K] pi[S]; // category prevalence in independent samples (with non-differential error assumption) 
  simplex[K] theta[C, K]; // pooled misclassification probablilties (complete pooling per group)
}
transformed parameters {
  vector[K] log_q_z[I];

  for (i in 1:I)
    log_q_z[i] = log(pi[sss[i]]);
  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[cc[n], k, y[n]]);
}
model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);

  for (j in 1:J)
    for (k in 1:K)
      theta[ccc[j], k] ~ dirichlet(beta[ccc[j], k]);

  for (i in 1:I)
    target += log_sum_exp(log_q_z[i]);
}
generated quantities {
  vector[K] softmax_lqz[I]; // probabilities of each k for each coding unit
  for (i in 1:I)
    softmax_lqz[i] = softmax(log_q_z[i, ]);
}

Full model: Partial pooling with annotator groups, multivariate normal prior on log scale

data {
  int<lower=2> K; // number of categories
  int<lower=1> C; // number of independent groups of coders
  vector[K] mu[C, K]; // means for multivariate normal
  matrix[K, K] L_Sigma[C, K]; // Cholesky factor of scale matrix;
  real<lower=0> sigma; // partial pooling SD
  int<lower=1> J; // number of coders
  int<lower=1> I; // number of coding units
  int<lower=1> N; // number of codings
  int<lower=1> S; // number of independent samples
  int<lower=1,upper=K> y[N]; // codings
  int<lower=1,upper=I> ii[N]; // index coding units in data
  int<lower=1,upper=J> jj[N]; // index coder in data
  int<lower=1,upper=S> ss[N]; // index sample in data
  int<lower=1,upper=I> iii[I]; // index coding units to sample (not necessary, for completeness)
  int<lower=1,upper=S> sss[I]; // index sample to coding units
  int<lower=1,upper=C> ccc[J]; // index coders to coder groups
  vector<lower=0>[K] alpha[S]; // dirichlet prior on category prevalence in studies
}
parameters {
  simplex[K] pi[S]; // category prevalence in independent samples (with non-differential error assumption)
  vector[K] eta_raw[C, K]; // overall misclassification probablilties on log scale (partial pooling)
  vector[K] kappa[J, K]; // coder deviation from overall misclassification probablilties on log scale (partial pooling)
}
transformed parameters {
  simplex[K] theta[J, K];
  vector[K] log_q_z[I];

  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] = softmax(eta_raw[ccc[j], k] + kappa[j, k]);

  for (i in 1:I)
    log_q_z[i] = log(pi[sss[i]]);

  for (n in 1:N)
    for (k in 1:K)
      log_q_z[ii[n], k] = log_q_z[ii[n], k] + log(theta[jj[n], k, y[n]]);
}
model {
  for (s in 1:S)
    pi[s] ~ dirichlet(alpha[s]);


  for (j in 1:J)
    for(k in 1:K) {
      eta_raw[ccc[j], k] ~ multi_normal_cholesky(mu[ccc[j], k], L_Sigma[ccc[j], k]);
      kappa[j, k] ~ normal(0, sigma);
    }

  for (i in 1:I)
    target += log_sum_exp(log_q_z[i]);
}
generated quantities {
  vector[K] softmax_lqz[I]; // probabilities of each k for each coding unit

  for (i in 1:I)
    softmax_lqz[i] = softmax(log_q_z[i, ]);
}

Edit: Fixed type in model code (again).

Bob_Carpenter · September 9, 2017, 6:03pm

Thanks for following up. That all sounds sensible.

Why is alpha S x K? I’d have thought you’d only have a single prevalence. But then I didn’t know what the doc on S meant, “number of independent samples” or the doc on pi, “non-differential error assumption”.

The alternative, of course, is to take the “golden annotators” as truth and not include any uncertainty at all. The way you have it should be more robust, though.

This also highlights just how much we need to vectorize the Dirichlet!

Marko · September 10, 2017, 8:00am

This is because of a common design of content analytical studies in communication research. There are often two separate samples: In the coder test (often called “intercoder reliability test”), all annotators (or coders, as they are called in comm research) annotate the same units. We then check whether the annotations of the same unit agree to an acceptable amount, often quantified with a intercoder reliability metric such as Krippendorff’s \alpha. If agreement is sufficient, the annotators start with the sample of the main study, where each unit is most often annotated by only one annotator (Note: The quite obvious shortcomings of this way of diagnosing annotation quality was actually one of the reasons I became interested in Dawid Skene type models ). The “non-differential error assumption” then means that the researchers assume the annotation quality to be the same in both the test and the main sample - there is only one theta (for each coder or coder group).

There are several recommendations on how to select the sample for the intercoder test. One recommendation is to select a random subset of all units. In this case, the category prevalence of both studies reflects the same population prevalence and only one pi (and, consequently, alpha) is necessary. However, the more common way to select units for the test sample is deliberate selection by the researchers. They basically try to select “good examples” for each category with as many different varieties as possible. In this process, units with categories which have low prevalence in the overall population are included in the test sample disproportionally more often. The category prevalence in the test sample does not indicate anything about the category prevalence in the study sample - this is what I meant with “independent samples” in the code comments. Consequently, I want to estimate pi separately for the samples and be able to specify different alpha - hence alpha[S] and pi[S]. Most of the time, only the pi for the main sample is of substantial interest.

Bob_Carpenter · September 10, 2017, 9:10am

Cool. Thanks for explaining. This makes a lot of sense if your final annotation task doesn’t match the original design in distribution. It’s one of the things I hadn’t thought about doing!

Me, too! Andrew and Jennifer Hill helped me formulate a model for my discrete annotation task, and we wound up rediscovering (re-inventing?) the Dawid and Skene model. I wrote a million blog entries on the topic back when I was a natural language processing engineer:

https://lingpipe-blog.com/category/data-annotation/

(some of them may say Breck Baldwin as author because the changeover to his owning the domain also took all the credit for the posts for reasons we can only reverse one post at a time).

Later, Becky Passonneau and I consolidated some of this thinking into a paper with an explanation of the Dawid and Skene model:

Passonneau and Carpenter. 2014. The Benefits of a Model of Annotation. TACL.

Cohen’s kappa and Krippendorf’s alpha don’t tell you anything about the quality of the annotations, and they treat agreement in error (two annotators make the same mistake) as agreement (surprisingly common with hard corpora) and they give you no way to take all the ratings/codings/annotations and create a good corpus from them by correcting for annotator accuracy and biases.

Although it should be a hierarchical model fitted with full Bayes, I use a penalized MLE in the paper to make it easier to digest for NLP researchers, who aren’t the best statisticians in the world and say things like “You need to justify your use of Bayesian inference” while being totally OK with penalized MLE (the thing that really needs asymptotic justification) because it’s what everyone else does.

The paper also explains how (estimated) entropy can be used to measure corpus uncertainty and how mutual information can be used for active learning to pick annotators and/or items to annotate.

Lots of work that can still be done here. The thing to read next is Raykar et al.'s beautiful paper:

http://jmlr.csail.mit.edu/papers/v11/raykar10a.html

I have a Stan implementation, but it’s really just a matter of replacing the prevalence model with a logistic regression that’s estimated along with everything else.

I also have a lot more information on things like how to do cross-validation and actually measure these things like an epidemiologist, along with a bunch of simpler models in this technical report from 2008 (wow, that seems like a long time ago now):

Carpenter. 2008. Multilevel Bayesian Models of Categorical Data Annotation

This was my “learning Bayes” paper! It was all done with BUGS, as there was no Stan. But it was very brittle—the models would take 24 hours to run and often crash. I keep meaning to take all this and turn it into a Stan case study.

This would make a great submission for StanCon by the way!

Marko · September 11, 2017, 3:37pm

Thank you so much for the input. I actually read your TACL paper some time ago and really liked it. I also tried out your implementation of the Dawid Skene model in R on GitHub, but the penalized MLE version did not work so well for the typical data sets with rather small test samples in comm research, so I abandoned the approach and then forgot about it…
I recently re-discovered the repository and used some functions from there to get the initial values for the Stan model, but somehow never made the connection between the authors of the paper and the Stan forum. The “Multilevel Bayesian Models of Categorical Data Annotation” paper looks great, I look forward to reading it. Also the Raykar et al. paper.

My next step will be to try to write a basic introductory paper on the approach for communication researchers who often conduct content analyses. This will be a challenge because of similar reasons you described above for NLP researchers. Also, Krippendorff himself is almost always a reviewer for articles about reliability in content analysis in comm research, and he is very protective (to say the least) of his reliability measures. We made this experience a few months ago when we tried to (and in the end succeeded, thanks to the editor) publish a paper on simple methods to account for measurement error in content analysis. So we will see how this will advance… I will share a working paper here when I found the time to write it up…

Topic		Replies	Views
Working Dawid-Skene example with simulation Modeling	2	1372	August 22, 2018
Hierarchical partial pooling for repeated binary trials Modeling specification , rstanarm	2	503	December 15, 2020
New marginalisation section for the Stan User's Guide Modeling	12	1163	June 28, 2021
New preprint: Statistical Models of Repeated Categorical Ratings: The R package rater Publicity rstan	0	893	March 22, 2021
Few-but-persistent divergences in hierarchical models with partially-pooled variances Modeling	9	697	March 30, 2022

Dawid-Skene annotation model with partial pooling of annotators

Related topics