So, my real model is a time series with ~1000 observations (3 years with daily data) and ~25 predictors.

You can imagine this as a 1000x25 matrix with each cell representing the contribution of the ith predictor on the jth observation. To get the final prediction, sum across the rows.

The *columns* are independent, whereas the rows are not. For each of the 25 predictors, there is a fairly expensive set of operations to fill in the column.

It takes about ~2 days to fit the model as-is. And I’d like to run it for longer! We expect to have to make regular tweaks with new data so bringing the runtime down is a major priority.

Without multithreading, the gradient evaluation is about ~0.35 seconds.

Since the unit of parallelization has to be the column, `reduce_sum`

is out of the question, since we need to return a `vector`

, and not a `real`

. But `map_rect`

is very much in play.

However, when I “repacked” the model and used `map_rect`

, with 8 threads, the gradient evaluation jumped to 6.2 seconds – so about a ~20x jump. Within this setup, more threads *is* better (with 4 threads, gradient eval was ~8 seconds), but not approaching the efficiency the non-multithreaded model. Aside from passing the data back and forth, there are no additional operations in the model between the non-multithreaded and the multithreaded versions.

To test how much of an impact the passing of data back and forth had, I wrote a toy model to illustrate the issue. It’s all in the notebook here.

This is on a Mac. I used the release candidate for `cmdstanr`

to compile the model. I haven’t done anything with MPI and I don’t know how to! (Sidenote, the stan wiki for MPI parallelism which is linked-to in several places seems to be dead. I have not found an alternative).

I am wondering if there is anything that can be done!