So, I’m attempting to create a list of “true” ratings for 10,000 pieces of media, based on scores from 1-5 from >40k users, each with 10-100 scorings. I’m using a basic, graded item response theory model without a discrimination parameter. To my dismay, despite all of the fancy reduce_sum and vectorization magic I’ve implemented but don’t understand, and despite my 40-core Xeon virtual server, I can’t seem to get it to run at a pace that would work. The best this performs, after limiting the focus to 2000 users and 200 books, is 1 hour for 2k samples. That seems absurdly slow to me. Any ideas?

functions {
real partial_sum(int[] gradeSlice,
int start, int end,
vector quality,
vector[] difficulty,
int[] gradeStorySlice,
int[] gradeAuthorSlice) {
return ordered_logistic_lpmf(gradeSlice | quality[gradeStorySlice[start:end]], difficulty[gradeAuthorSlice[start:end]]);
}
}
data {
int <lower=1> numStories;
int <lower=1> numAuthors;
int <lower=1, upper=(numAuthors*numStories)> numGrades;
int <lower=3> maxGrade;
int <lower=1, upper=numAuthors> gradeAuthor[numGrades];
int <lower=1, upper=numStories> gradeStory[numGrades];
int <lower=1, upper=maxGrade> grade[numGrades];
}
parameters {
ordered[maxGrade-1] difficulty[numAuthors];
vector[numStories] quality;
real muEasiestGradeDiff;
real sgEasiestGradeDiff;
vector<lower=0>[maxGrade-2] muDifficultyStepSize;
vector<lower=0>[maxGrade-2] sgDifficultyStepSize;
}
model {
quality ~ std_normal();
muEasiestGradeDiff ~ normal(-1, 1);
sgEasiestGradeDiff ~ lognormal(0.5, 1);
difficulty[1:numAuthors, 1] ~ normal(muEasiestGradeDiff, sgEasiestGradeDiff);
muDifficultyStepSize ~ lognormal(0.5, 1);
sgDifficultyStepSize ~ lognormal(0.5, 0.5);
for(j in 2:maxGrade-1) {
to_vector(difficulty[1:numAuthors, j]) - to_vector(difficulty[1:numAuthors, j-1]) ~ lognormal(muDifficultyStepSize[j-1], sgDifficultyStepSize[j-1]);
}
target += reduce_sum(partial_sum, grade, 1, quality, difficulty, gradeStory, gradeAuthor);
}

I agree that something seems off with that performance. Is it possible that you aren’t using the ordered_logistic properly? That is, it takes two arguments, one that specifies the mean of the latent normal and one that specifies a vector of latent cutpoints defining bins on the scale of the latent normal. The naming (“difficulty”) of the variable that you pass as the cutpoints leads me to wonder if you might not realize the purpose of the cutpoints (though it’s equally possible I’m not understanding your intent for that parameter).

Also, typically with ordinal models there is need to constrain something to ensure identifiability, which I’m not seeing in your code anywhere. Maybe see here and here?

Finally, if you intend to index the full range of a dimension of a variable, you can just omit the index. So instead of difficulty[1:numAuthors, 1] you can do simply difficulty[, 1]. I don’t think that’s likely to cause much (if any) performance degradation, but is easier to read and less likely to fall prey to errors in writing.

That model has about 8200 parameters. That shouldn’t take more than a few minutes if the model is well-identified. More than an hour sounds like the model parametrization is poorly-identified and the sampler may be hitting the treedepth limit.

So the first question is how large is the treedepth__? And what is the effective sample size?

Identifiability aside, reduce_sum is most effective when (1) the partial_sum does a lot of computation and (2) you’re slicing an array of parameters. In this case (1) does not hold because there’s only one statement in the function; the overhead of copying data around may well add more time than doing the calculation in parallel saves. You also have a problem with (2): the first argument gradeSlice is data and not a parameter. That means that quality and difficulty need to be copied in their entirety to every thread (even the parts that are not used because the program is not smart enough to predict which entries each thread uses) and unfortunately parameters are more expensive to copy. A better approach would be to order the grades by user and then slice make the difficulty the primary array argument to reduce_sum.

Number of effective samples is 1-5k for each of the difficulty parameters and ~200 for the books; diagnostics don’t alert me to any warnings and says the treedepth is fine. I’ve inspected the actual results for the 2k-200 mini-dataset and they seem accurate, inasmuch as they’re close enough to the raw average rating to be plausible. I have also tried to simplify the model by removing some hyperiors and just setting 0.5, 0.75 etc. for std dev - barely changes the results or speed.

The number of ordinal categories is indeed 5. I actually underestimated the number of observations for the 2000 & 200 dataset; it’s around 60k.

I"m mostly going off of this: IRT simulation notebook FINAL. The model you see is similar to that one with a few adjustments. The first parameter to ordered_logistic (quality), representing the quality of the media item, is analogous to user ability in a standard IRT model. The second parameter (difficulty), represents the standards of the user for that score on the 1-5 scale. I think I’m using them correctly?