Specifying a cumulative ordinal probit model with heterogenous thresholds

I am trying to fit a multiple-rater cumulative ordinal probit model to data where raters judge the quality of student essays on a 1 (abysmal) to 6 (outstanding) scale. The dataset is cross-classified, such that raters see different essays and essays are judged by different raters.

In this model, I am making the following assumptions:

  1. Essays have underlying “goodness” that is measured in the probit scale and that raters’ estimates of this “goodness” vary only due to individual differences across raters.
  2. Raters exhibit different levels of precision in their ratings (i.e., some are more spot-on in estimating how good the essays are)
  3. Thresholds differ across raters, such that the distance between thresholds vary across raters (i.e., the difference in thresholds across raters is not just a collective shift in the thresholds).

Given these assumptions, my goal for this model is to obtain estimates of the following:

  • the underlying “goodness” of each essay i in the probit scale (zeta_i)
  • the variance term that governs the precision of rater j‘s estimates of the essays’ “goodness” (sigma2_j)
  • the thresholds applied by each rater j (tau_j)

I initially sought to fit this model in brms but it would only allow for Assumption 3 (through cs(1) | Rater) if I modeled my data using an adjacent-category ordinal model. Because I’m trying to model my data as a cumulative probit model for theoretical reasons, I leveraged Stan code that I understood from what brms generated and attempted to hard-code the model I want in Stan with the following syntax:

  /* Adapted from brms:
   *    - I changed disc here to refer to variance

   * cumulative-probit log-PDF for a single response
   * Args:
   *   y: response category
   *   mu: latent mean parameter
   *   disc: variance parameter 
   *   thres: ordinal thresholds
   * Returns:
   *   a scalar to be added to the log posterior
   real cumulative_probit_lpmf(int y, real mu, real disc, row_vector thres) {
     int nthres = num_elements(thres);
     real p;
     if (y == 1) {
       p = Phi((1/sqrt(disc)) * (thres[1] - mu));
     } else if (y == nthres + 1) {
       p = 1 - Phi((1/sqrt(disc)) * (thres[nthres] - mu));
     } else {
       p = Phi((1/sqrt(disc)) * (thres[y] - mu)) -
           Phi((1/sqrt(disc)) * (thres[y - 1] - mu));
     return log(p);
data {
  int K; // number of categories
  int N; // number of ratings
  int y[N]; // response variable
  int n_raters; // number of unique raters
  int n_essays; // number of unique essays
  int index_raters[N]; // index of raters
  int index_essays[N]; // index of essay
parameters {
  // latent goodness of essay
  real zeta_i[n_essays];

  // variance for each rater; index of precision heterogeneity
  real<lower = 0.001> sigma2_j[n_raters];

  // vector of threshold placeholders for each rater
  row_vector<lower = -10, upper = 10>[K-1] tau_raw[n_raters];
transformed parameters {
// vector of thresholds for each rater
  row_vector<lower = -10, upper = 10>[K-1] tau_j[n_raters];

  for (j in 1:n_raters) {
    // sort tau_raw to meet constraint that lower category thresholds are
    // smaller than upper category thresholds
    tau_j[j] = sort_asc(tau_raw[j]);
model {
  // Priors
  target += normal_lpdf(zeta_i | 0, 1);
  target += inv_gamma_lpdf(sigma2_j | 1.5, 1.5);

  for (j in 1:n_raters) {
    target += uniform_lpdf(tau_raw[j] | -10, 10);

  // Likelihood
  for (n in 1:N) {
    target += cumulative_probit_lpmf(y[n] | zeta_i[index_essays[n]], sigma2_j[index_raters[n]], tau_j[index_raters[n]]);

For the most part, chains converged and R-hats seemed acceptable for my parameters after running this model with Stan defaults in R on a trial dataset of 58 raters judging 58 essays (~3000 observations), although the ESS are way smaller for the heterogenous thresholds.

My primary issue is that running this model took 5 hours on a computer that has 16 GB RAM and 4 cores (despite gradient evaluation taking only 0.003 sec). In contrast, running the same model with an old MATLAB code from a reference book that uses Metropolis-Hastings only takes a minute or two to run. The results are similar between the Stan model and the MATLAB model, except for the thresholds (presumably because of the small ESS).

I anticipated some runtime differences with the MATLAB code maybe because of Stan’s workhorse, but the fact that it takes 5 hours to run and that the time differential is massive suggests to me that I’ve specified my model to do something inefficiently, if not flat out wrong. I don’t quite know how to diagnose the inefficiency in my model, which is why I’m turning here for advice.

I thought that maybe one of the reasons is because I’m forcing the model to estimate more things that I need, e.g. tau_raw. My results indicate problems estimating tau_raw (miniscule ESS and massive R-hats), but I only really need it as a placeholder for the threshold estimates so it could be sorted in tau_j. Is there a way to bypass the estimation of tau_raw or bypass keeping tau_raw's estimates so that I only get tau_j estimates? I tried to parameterize tau_j as type ordered but that made results bad for all the parameters I’m estimating (i.e., ESS ~ 20s and big R-hats)…

Hey there! Welcome to the Stan forum! Sorry you did not get an reply earlier…

There is probably no better person to ask than @paul.buerkner. Hopefully he is not too busy. :)


1 Like

I don’t have time to check the Stan code in detail but with brms 2.12.0+ you can have heterogenous thresholds even in the cumulative model via

brm(y | thres(..., gr = Rater) ~ ..., family = cumulative("probit"))

I recommend installing the latest github version of brms rather than the CRAN version since the latter still has a bug in the grouped threshold stan code.


Thanks for the tip, @paul.buerkner! I installed the latest Github version, and run my model using the following syntax. I’m trying to model both heteregenous thresholds and heterogenous disc (or sd) parameters across the raters (as they’re theoretically motivated), so my syntax is:

brm(bf(Rating | thres(gr = Rater) ~ 1 + (1 | Rater) + (1 | Essay)) +
      lf(disc ~ 0 + (1 | Rater)),
    family = cumulative("probit"), data = essay,
    prior = c(set_prior("uniform(-10,10)", class = "Intercept"),
              set_prior("normal(0,1)", class = "sd", group = "Essay")),
    control = list(adapt_delta = 0.98),
    chains = 4,
    warmup = 2000,
    init_r = 0.1,
    cores = parallel::detectCores(),
    iter = 4000)

The brm model run much quicker than the one I hard-coded. But I got the following warning messages (among others) in the brm model that I did not get from my hard-coded model:

Warning messages:
2: There were 7415 divergent transitions after warmup. Increasing adapt_delta above 0.98 may help See ...
3: There were 585 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See ... 

I’m acquainted with the problems associated with divergent transitions, and that is a lot of divergent transitions given 8000 (2000 x 4) post-warm-up samples. I’ve already incrementally increased adapt_delta from the 0.8 default, but I have still have persistently gotten these warnings with the brm model. The ESS (both Bulk and Tail) are also very small (only in the tens!) compared to my hard-coded model, probably because of the divergent transitions.

I don’t really know where things are going wrong here, and I’d very much appreciate input on how to resolve the issue!

It looks like there is a fundamental issue that leads to so many divergent transititions. What you can try is to replace probit by logit which is usually more numericall stable.

Also, I would start with a more simple model and gradually expand towards more complicated ones to diagnose at which point the problem occurs.