I want to use the nice reduce_sum implementation to parallelize the likelihood evaluation of my model. In this process, I recognized that two different implementations, that give the same result, have very different run times when many threads are used. Unluckily the slower implementation is much nicer and in some cases the only possible implementation without moving part of the likelihood evaluation outside of the partial_sum function. I wonder why the one implementation is so much faster and if there is a way to get the other implementation to the same speed.

Below is the code for the redcard example (as given in https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html). I will refer to this in the following as â€śCase Aâ€ť. Here the vector â€śbeta[1] + beta[2] * rating[start:end]â€ť is inserted directly into the binomial_logit_lpmf function. Another way would be to define a vector res, set it to beta[1] + beta[2] * rating[start:end] and insert this res vector in the binomial_logit_lpm function (â€śCase Bâ€ť from here on)

**Redcard Example Case A**

```
functions {
real partial_sum(int[] slice_n_redcards,
int start, int end,
int[] n_games,
vector rating,
vector beta) {
return binomial_logit_lpmf(slice_n_redcards |
n_games[start:end],
beta[1] + beta[2] * rating[start:end]); }
}
data {
int<lower=0> N;
int<lower=0> n_redcards[N];
int<lower=0> n_games[N];
vector[N] rating;
int<lower=1> grainsize;
}
parameters {
vector[2] beta;
}
model {
beta[1] ~ normal(0, 10);
beta[2] ~ normal(0, 1);
target += reduce_sum(partial_sum, n_redcards, grainsize,
n_games, rating, beta);
}
```

**Redcard Example Case B**

```
functions {
real partial_sum(int[] slice_n_redcards,
int start, int end,
int[] n_games,
vector rating,
vector beta) {
vector[end-start+1] res;
res = beta[1]+beta[2]*rating[start:end];
return binomial_logit_lpmf(slice_n_redcards | n_games[start:end], res);
}
}
```

Where in Case B only the partial sum function is different. The rest is the same like in Case A.

When I test the run time for both cases, I see that when I am using one thread per chain, both cases take about 2:40 for 1000 warm-ups and 1000 samples, but if I use a lot of threads (namely 48 in this example), the run time for Case A is about 10 seconds and for Case B it is about 45 seconds. So defining and setting the res vector before entering it into the binomial_logit_lpmf slows down the calculation significantly when many threads are used.

In the above redcard example it is of course no problem to just use Case A, but in the model I want to fit, the total model is a superposition of different sources. In a very simple case of this model, the total model is simply given like this:

M_{tot}=\Sigma_i c_i \cdot V_i

Where V_i are precalculated vectors for the individual sources and I am only fitting for the normalizations c_i. The number of sources can be up to 30 and then it starts to get impractical to write everything like in Case A for the redcard example. I did the run time comparison also for a simple stan model with 5 sources, with some simulated mock data. So the sources code could be either

**Simple Model Case A**

```
functions {
real partial_sum(int[] counts, int start, int stop,
real[] c, vector[] base_counts, int num_bases){
return poisson_propto_lpmf(counts |
c[1]*base_counts[1][start:stop]+
c[2]*base_counts[2][start:stop]+
c[3]*base_counts[3][start:stop]+
c[4]*base_counts[4][start:stop]+
c[5]*base_counts[5][start:stop]
);
}
}
data {
int<lower=0> datapoints;
int counts[datapoints];
int num_bases;
vector[datapoints] base_counts[num_bases];
int grainsize;
}
parameters {
real<lower=0> c[num_bases];
}
model{
c ~ lognormal(0,1);
target += reduce_sum(partial_sum, counts, grainsize, c, base_counts, num_bases);
}
generated quantities {
int ppc[datapoints];
vector[datapoints] tot;
tot = rep_vector(0.0, datapoints);
for (i in 1:num_bases){
tot += c[i]*base_counts[i];
}
ppc = poisson_rng(tot);
}
```

or **Simple Model Case B**

```
functions {
real partial_sum(int[] counts, int start, int stop,
real[] c, vector[] base_counts, int num_bases){
vector[stop-start+1] model_counts;
model_counts = rep_vector(0.0, stop-start+1);
for (i in 1:num_bases){
model_counts += c[i]*base_counts[i][start:stop];
}
return poisson_propto_lpmf(counts | model_counts);
}
}
```

The runtime analysis is similar compared to the redcard example. For one thread per chain, the runtime is nearly the same (about 8:30 minutes for 500 warm-ups and 300 samples), but if I use 48 threads the run time for case A is 1:30 and for Case B 9:00. So case B with 48 threads is slower than with one thread. In a real fit I have up to 30 of these sources and not always the same number of sources, so using a for loop like in Case B would be much nicer. Also in a more realistic model, I have more complicated sources, for which I do not only fit a normalization. Try to add them directly to the poisson_propto_lpmf function like in Case A is not always possible (or at least very difficult), which means I probably would need to move parts of the calculation outside of the partial_sum function, to avoid defining a vector in the partial_sum function, which is used to sum all the different source contributions together.

Is this large run time difference, when many threads are used, due to memory allocation for the model_counts vector, that has to be done by every thread?

And is there a way to make Case B as fast as Case A, even for many threads?