I am trying to fit a multiple-rater cumulative ordinal probit model to data where raters judge the quality of student essays on a 1 (abysmal) to 6 (outstanding) scale. The dataset is cross-classified, such that raters see different essays and essays are judged by different raters.

In this model, I am making the following assumptions:

- Essays have underlying “goodness” that is measured in the probit scale and that raters’ estimates of this “goodness” vary only due to individual differences across raters.
- Raters exhibit different levels of precision in their ratings (i.e., some are more spot-on in estimating how good the essays are)
- Thresholds differ across raters, such that the
*distance between thresholds*vary across raters (i.e., the difference in thresholds across raters is not just a collective shift in the thresholds).

Given these assumptions, my goal for this model is to obtain estimates of the following:

- the underlying “goodness” of each essay i in the probit scale (
`zeta_i`

) - the variance term that governs the precision of rater j‘s estimates of the essays’ “goodness” (
`sigma2_j`

) - the thresholds applied by each rater j (
`tau_j`

)

I initially sought to fit this model in `brms`

but it would only allow for Assumption 3 (through `cs(1) | Rater`

) if I modeled my data using an *adjacent-category* ordinal model. Because I’m trying to model my data as a cumulative probit model for theoretical reasons, I leveraged Stan code that I understood from what `brms`

generated and attempted to hard-code the model I want in Stan with the following syntax:

```
functions{
/* Adapted from brms:
* - I changed disc here to refer to variance
* cumulative-probit log-PDF for a single response
* Args:
* y: response category
* mu: latent mean parameter
* disc: variance parameter
* thres: ordinal thresholds
* Returns:
* a scalar to be added to the log posterior
*/
real cumulative_probit_lpmf(int y, real mu, real disc, row_vector thres) {
int nthres = num_elements(thres);
real p;
if (y == 1) {
p = Phi((1/sqrt(disc)) * (thres[1] - mu));
} else if (y == nthres + 1) {
p = 1 - Phi((1/sqrt(disc)) * (thres[nthres] - mu));
} else {
p = Phi((1/sqrt(disc)) * (thres[y] - mu)) -
Phi((1/sqrt(disc)) * (thres[y - 1] - mu));
}
return log(p);
}
}
data {
int K; // number of categories
int N; // number of ratings
int y[N]; // response variable
int n_raters; // number of unique raters
int n_essays; // number of unique essays
int index_raters[N]; // index of raters
int index_essays[N]; // index of essay
}
parameters {
// latent goodness of essay
real zeta_i[n_essays];
// variance for each rater; index of precision heterogeneity
real<lower = 0.001> sigma2_j[n_raters];
// vector of threshold placeholders for each rater
row_vector<lower = -10, upper = 10>[K-1] tau_raw[n_raters];
}
transformed parameters {
// vector of thresholds for each rater
row_vector<lower = -10, upper = 10>[K-1] tau_j[n_raters];
for (j in 1:n_raters) {
// sort tau_raw to meet constraint that lower category thresholds are
// smaller than upper category thresholds
tau_j[j] = sort_asc(tau_raw[j]);
}
}
model {
// Priors
target += normal_lpdf(zeta_i | 0, 1);
target += inv_gamma_lpdf(sigma2_j | 1.5, 1.5);
for (j in 1:n_raters) {
target += uniform_lpdf(tau_raw[j] | -10, 10);
}
// Likelihood
for (n in 1:N) {
target += cumulative_probit_lpmf(y[n] | zeta_i[index_essays[n]], sigma2_j[index_raters[n]], tau_j[index_raters[n]]);
}
}
```

For the most part, chains converged and R-hats seemed acceptable for my parameters after running this model with Stan defaults in R on a trial dataset of 58 raters judging 58 essays (~3000 observations), although the ESS are way smaller for the heterogenous thresholds.

My primary issue is that running this model took **5 hours** on a computer that has 16 GB RAM and 4 cores (despite gradient evaluation taking only 0.003 sec). In contrast, running the same model with an old MATLAB code from a reference book that uses Metropolis-Hastings only takes a minute or two to run. The results are similar between the Stan model and the MATLAB model, except for the thresholds (presumably because of the small ESS).

I anticipated some runtime differences with the MATLAB code maybe because of Stan’s workhorse, but the fact that it takes 5 hours to run and that the time differential is massive suggests to me that I’ve specified my model to do something inefficiently, if not flat out wrong. I don’t quite know how to diagnose the inefficiency in my model, which is why I’m turning here for advice.

I thought that maybe one of the reasons is because I’m forcing the model to estimate more things that I need, e.g. `tau_raw`

. My results indicate problems estimating `tau_raw`

(miniscule ESS and massive R-hats), but I only really need it as a placeholder for the threshold estimates so it could be sorted in `tau_j`

. Is there a way to bypass the estimation of `tau_raw`

or bypass keeping `tau_raw`

's estimates so that I only get `tau_j`

estimates? I tried to parameterize tau_j as type `ordered`

but that made results bad for all the parameters I’m estimating (i.e., ESS ~ 20s and big R-hats)…