Possible confusion around 'grainsize' argument for 'reduce_sum'

lnix1 · August 21, 2020, 2:18pm

Hi all,

In a post of mine a few weeks ago, others suggested I utilize reduce_sum via cmdstanr to enable scaling of an IRT model I had built. I was able to implement reduce_sum and the model does compile with multithreading support and run properly. However, it is now vastly slower (contrary to the goal!). After a little debugging and playing with the model, I discovered that the ‘grainsize’ was at fault and was able to recover the model, but only back to speeds equivalent to my original model which had not used reduce_sum.

Initially, I had set created ‘grainsize’ within the data block and passed ‘grainsize = 1’ as part of my data list for the model. The choice of 1 was to enable the automatic calibration of the grainsize suggested in documentation. However, I do not believe this is happening given the massive increase in computational time as well as the reduction from setting a more reasonable grainsize of roughly 400.

My question is, what would cause the automatic calibration to break/otherwise not trigger (for lack of a better word), and do others have suggestions for how I might find a reasonable grainsize on my own to work around this issue? Also, is it more likely that this would be an error on my end in how I have constructed my implementation of reduce_sum?

Edit 1: My problems may also stem from what I believe is a non-vectorized function used in the likelihood calculations of my model. The stan code was generated using brms::make_stancode(). The function and likelihood bits are as follows:

functions {
/* cumulative-logit log-PDF for a single response
   * Args:
   *   y: response category
   *   mu: latent mean parameter
   *   disc: discrimination parameter
   *   thres: ordinal thresholds
   * Returns:
   *   a scalar to be added to the log posterior
   */
   real cumulative_logit_lpmf(int y, real mu, real disc, vector thres) {
     int nthres = num_elements(thres);
     real p;
     if (y == 1) {
       p = inv_logit(disc * (thres[1] - mu));
     } else if (y == nthres + 1) {
       p = 1 - inv_logit(disc * (thres[nthres] - mu));
     } else {
       p = inv_logit(disc * (thres[y] - mu)) -
           inv_logit(disc * (thres[y - 1] - mu));
     }
     return log(p);
   }
}
model {
  for (n in 1:N) {
        target += cumulative_logit_lpmf(Y[n] | mu[n], disc[n], Intercept);
      }
}

Is it possible that my issues with ‘grainsize’ are actually a result of the non-vectorize nature of the bulk of my model’s calculations?

torkar · September 4, 2020, 6:36am

Hi, sorry but I think your post was lost in the general confusion here :) I’ve not worked with reduce_sum (yet) myself, but the grainsize expert could definitely help you, i.e., @wds15 :)

wds15 · September 4, 2020, 7:24am

How does the full reduce sum version look?

Is your data sorted by any chance? It should not be sorted

You may use the static reduce sum version and increase the number of chunks to see the overhead due to chunking.

Topic		Replies	Views
Reduce_sum error Modeling	4	881	May 13, 2020
Stan significantly slower after incorporating multithreading? CmdStan paralellization	6	839	May 3, 2023
Understanding reduce_sum efficiency Modeling	10	845	March 22, 2021
Reduce_sum performance Modeling performance , paralellization	15	1342	September 28, 2020
Grainsize when (chunks/cores) < 2 Modeling	7	556	March 11, 2021

Possible confusion around 'grainsize' argument for 'reduce_sum'

Related topics