Possible confusion around 'grainsize' argument for 'reduce_sum'

Hi all,

In a post of mine a few weeks ago, others suggested I utilize reduce_sum via cmdstanr to enable scaling of an IRT model I had built. I was able to implement reduce_sum and the model does compile with multithreading support and run properly. However, it is now vastly slower (contrary to the goal!). After a little debugging and playing with the model, I discovered that the ‘grainsize’ was at fault and was able to recover the model, but only back to speeds equivalent to my original model which had not used reduce_sum.

Initially, I had set created ‘grainsize’ within the data block and passed ‘grainsize = 1’ as part of my data list for the model. The choice of 1 was to enable the automatic calibration of the grainsize suggested in documentation. However, I do not believe this is happening given the massive increase in computational time as well as the reduction from setting a more reasonable grainsize of roughly 400.

My question is, what would cause the automatic calibration to break/otherwise not trigger (for lack of a better word), and do others have suggestions for how I might find a reasonable grainsize on my own to work around this issue? Also, is it more likely that this would be an error on my end in how I have constructed my implementation of reduce_sum?

Edit 1: My problems may also stem from what I believe is a non-vectorized function used in the likelihood calculations of my model. The stan code was generated using brms::make_stancode(). The function and likelihood bits are as follows:

functions {
/* cumulative-logit log-PDF for a single response
   * Args:
   *   y: response category
   *   mu: latent mean parameter
   *   disc: discrimination parameter
   *   thres: ordinal thresholds
   * Returns:
   *   a scalar to be added to the log posterior
   real cumulative_logit_lpmf(int y, real mu, real disc, vector thres) {
     int nthres = num_elements(thres);
     real p;
     if (y == 1) {
       p = inv_logit(disc * (thres[1] - mu));
     } else if (y == nthres + 1) {
       p = 1 - inv_logit(disc * (thres[nthres] - mu));
     } else {
       p = inv_logit(disc * (thres[y] - mu)) -
           inv_logit(disc * (thres[y - 1] - mu));
     return log(p);
model {
  for (n in 1:N) {
        target += cumulative_logit_lpmf(Y[n] | mu[n], disc[n], Intercept);

Is it possible that my issues with ‘grainsize’ are actually a result of the non-vectorize nature of the bulk of my model’s calculations?

Hi, sorry but I think your post was lost in the general confusion here :) I’ve not worked with reduce_sum (yet) myself, but the grainsize expert could definitely help you, i.e., @wds15 :)

How does the full reduce sum version look?

Is your data sorted by any chance? It should not be sorted

You may use the static reduce sum version and increase the number of chunks to see the overhead due to chunking.