So, my real model is a time series with ~1000 observations (3 years with daily data) and ~25 predictors.
You can imagine this as a 1000x25 matrix with each cell representing the contribution of the ith predictor on the jth observation. To get the final prediction, sum across the rows.
The columns are independent, whereas the rows are not. For each of the 25 predictors, there is a fairly expensive set of operations to fill in the column.
It takes about ~2 days to fit the model as-is. And I’d like to run it for longer! We expect to have to make regular tweaks with new data so bringing the runtime down is a major priority.
Without multithreading, the gradient evaluation is about ~0.35 seconds.
Since the unit of parallelization has to be the column,
reduce_sum is out of the question, since we need to return a
vector, and not a
map_rect is very much in play.
However, when I “repacked” the model and used
map_rect, with 8 threads, the gradient evaluation jumped to 6.2 seconds – so about a ~20x jump. Within this setup, more threads is better (with 4 threads, gradient eval was ~8 seconds), but not approaching the efficiency the non-multithreaded model. Aside from passing the data back and forth, there are no additional operations in the model between the non-multithreaded and the multithreaded versions.
To test how much of an impact the passing of data back and forth had, I wrote a toy model to illustrate the issue. It’s all in the notebook here.
This is on a Mac. I used the release candidate for
cmdstanr to compile the model. I haven’t done anything with MPI and I don’t know how to! (Sidenote, the stan wiki for MPI parallelism which is linked-to in several places seems to be dead. I have not found an alternative).
I am wondering if there is anything that can be done!