Vectorisation vs. parallelization

I would like to understand at what point the vectorisation stops to give performance advantage.

Some examples:

  • This clearly gives great speedup

int[N] ~ neg_binomial(real, real);

  • What about this? (where real[N] is some tranformed parameter of a <N parameter space)

int[N] ~ neg_binomial(real[N], real);

  • What about this? (where real[N] is some tranformed parameter of a <N parameter space)

int[N] ~ neg_binomial(real[N], real[N]);

All of them, compared to calling neg_binomial N times in a loop because each call entails memory allocation and nodes on the autodiff expression tree.

1 Like

Thanks Ben,

For parallelization, @wds15 so do you think as general rule (if can exist) it is better (with N very big ~ 10.000)

  • This, with good grouping of M in shards
int y[M, N];
vector[N] mu;

...

map_rect( {
...

for(m in 1:M) y[m] ~ neg_binomial(mu[m], sigma); 

...
});
  • Or this
int y_linearised[M * N];
vector[N*M] mu_mapped_on_y_linearised;
...

y_linearised ~ neg_binomial(mu_mapped_on_y_linearised, sigma);
1 Like

My guess is that with N=10^4 you will still be faster with the second approach without map_rect, but with bigger N the map_rect solution will be at some point faster.

The parallel reduce which I am working on would speed this type of stuff up in full automation (as you will then get automatic sharding and vectorization)… but it will take a while until that will be ready (after Stan 3 at least).

1 Like