Parallel v3 map

Hi!

@bbbales2 what do you think about the map function I proposed on the other PR with this signature:

return_t mapped_function(int i, elem_t element_i, args...);

std::vector<return_t> map(mapped_function, std::vector<elem_t> large_container, args...);

I have started to look more into brms and it dawns on me that brms stuff would benefit a lot from a map facility. So what do you think of this signature? Are your tuple utilities matured to some extent such I can use these? Should I branch off the ODE tuple prototype to code this up in a first go or any other suggestions?

I am relatively sure that the building blocks we have in place should allow for a map which is more handy to have in a few situations. The key to make it efficient is to have the mapped large_container be an array of an arbitrary data structure with any complexity. This way there do not need to be many shared arguments which are costly as they get copied quite a bit.

BTW, just to make sure, we are right now waiting for parser gurus to add reduce_sum to the stanc3 world, right?

Sebastian

Would it be possible to get an imaginary example for a stan model with a parallel reduce call and what the underlying hpp would look like (preferably a working hpp file but the c++ signature might suffice).

The wiki contains an example at the bottom (stan + C++).

Here is an example which is in line with the latest code, so use that:

poisson-hierarchical-scale-v3F.stan (3.8 KB)

parallel_reduce-v3F.hpp (3.0 KB)

1 Like

Thanks! Didnt follow everything closely so I missed this.

No worries… thanks a lot for taking this up. I am really keen on getting this forward (but I don’t want to stress anyone here, of course).

1 Like

The example you posted here is just for reduce no?

And yeah wrt brms it feels like if we can cover a lot of generic brms models we’ll cover a lot of the standard models Stan is used for (and then the derivatives of those)

Yes. The example posted is for reduce_sum. The map function I suggest above is right now only existent in my mind… sorry for convoluting this…

If return_t is not a scalar, then things aren’t efficient unless we parallelize reverse mode.

Aki wanted something like this for gp functors.

I guess what is the motivation in this case?

Above meant to use a C++ std::vector and then it should be fine efficiency wise (but I would anyway not bother with that given we want to parallelise big chunk of work).

The motivation is driven by the structure of brms programs. These usually first compute the mean mu for each data-item. Then the brms programs pass that into the log-likelihood. However, in my programs it is exorbitantly expensive to calculate the mean already (which can mean to solve an ODE for every patient in my data-set). So for the brms structure, my programs would benefit a lot from being able to calculate the mean vector mu using a parallel map.

If we would have a map, then we could instruct brms to call a custom written non-linear function to compute the mean. That function then fires off the work with a map onto different cores.

I don’t quite see how you can combine computing mu and the log-link in one go with brms - and I do think that there will be cases where a map is simply just easier to deal with.

Well Paul was hesitant to throw mu into the calculations the last time we asked him, but specifically the type of model I want to parallelize looks like: https://discourse.mc-stan.org/uploads/short-url/lZnGSFzTlTcrfNIIfuRn1Rpu2vq.stan

This is sort of the simplest type of brms model so I could understand if this wouldn’t generalize, but it’s the thing I’m interested in.

The code in question is:

model {
  vector[N] mu = temp_Intercept + Xc * b;
  for (n in 1:N) {
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] +
      r_2_1[J_2[n]] * Z_2_1[n] +
      r_3_1[J_3[n]] * Z_3_1[n] + r_3_2[J_3[n]] * Z_3_2[n] +
      r_4_1[J_4[n]] * Z_4_1[n] +
      r_5_1[J_5[n]] * Z_5_1[n] +
      r_6_1[J_6[n]] * Z_6_1[n] + r_6_2[J_6[n]] * Z_6_2[n] +
      r_7_1[J_7[n]] * Z_7_1[n] + r_7_2[J_7[n]] * Z_7_2[n] +
      r_8_1[J_8[n]] * Z_8_1[n] +
      r_9_1[J_9[n]] * Z_9_1[n] +
      r_10_1[J_10[n]] * Z_10_1[n] +
      r_11_1[J_11[n]] * Z_11_1[n] + r_11_2[J_11[n]] * Z_11_2[n];
  }

  ... // I removed the priors cause I don't think those parallelize

  // likelihood including all constants
  if (!prior_only) {
    target += binomial_logit_lpmf(Y | trials, mu);
  }
}

I want to try replacing it with:

target += reduce_sum(custom_parallel_lpmf, 5000, Y, trials, temp_Intercept,
                     Xc, b,
                     J_1, J_2, ...,
                     r_1_1, Z_1_1, r_1_2, Z_1_2, ...);

Or maybe if that matrix vector is fast enough I can just do:

target += reduce_sum(custom_parallel_lpmf, 5000, temp_Intercept + Xc * b, Y, trials, 
                     J_1, J_2, ...,
                     r_1_1, Z_1_1, r_1_2, Z_1_2, ...);

Once I know how this works then I’ll be able to say if I need anything more.

Add this map thing would mean we could write programs that were like:

a = parallel_stuff();
b = stuff_that_has_to_be_serial(a);
c = more_parallel_stuff(b);
...

But I don’t need that so I’m hesitant to jump on board with another parallelization mechanism.

Well… I would be interested as well… I guess the short-term solution is to modify by hand the brms source code, right? But what’s the long-term goal? I mean is the idea to convince @paul.buerkner to go down this route and extend brms so that it auto-generates towards using reduce_sum eventually?

The fastest way will always be with the reduce_sum thing. One of my intentions is also to let people write their code as they can come up with it… and some things are much easier expressed with a map. My thought was to just give it a try and see how far I get. So should I branch off the ODE thing in case I do?

I am happy to support reduce_sum directly in brms, I just expressed my concern that the most efficient way to writing this (i.e., putting everything in the reduce function) will likely not make it into brms.

@andrewgelman has 3 large hierarchical models he wants to be fast. So my goal is to evaluate to what extent we can make those faster.

I don’t really have long-term goals on this. I might get some once I see how good/bad reduce_sum works for those three models, but we’ll see.

I’ll defer to @paul.buerkner on brms. Really I’m speaking about those three models, one of which is code-gen’d with brms.

Without the reduction I don’t think this would work so well. The downside is then you gotta store a huge Jacobian. In the regression cases, I might have 50000 log density terms to get accumulated and 500 parameters.

Ofc. there might be other applications, but I think we should dig them up before we jump on anything.

Other things down this alley:

  1. Can reduce_sum work with higher order autodiff?
  2. Parallel reverse mode

edit: Fixed sentence to make it coherent

Also I totally made up that 500 number. Practically somewhere between 50 and 1000 though, more on the low side.

For Ben’s example here

If we first get “vectorized ops” faster and have BRMS do

  mu += r_1_1[J_1] .* Z_1_1 + r_1_2[J_1] .* Z_1_2 +
    r_2_1[J_2] .* Z_2_1 +
    r_3_1[J_3] .* Z_3_1 + r_3_2[J_3] .* Z_3_2 +
    r_4_1[J_4] .* Z_4_1 +
    r_5_1[J_5] .* Z_5_1 +
    r_6_1[J_6] .* Z_6_1 + r_6_2[J_6] .* Z_6_2 +
    r_7_1[J_7] .* Z_7_1 + r_7_2[J_7] .* Z_7_2 +
    r_8_1[J_8] .* Z_8_1 +
    r_9_1[J_9] .* Z_9_1 +
    r_10_1[J_10] .* Z_10_1 +
    r_11_1[J_11] .* Z_11_1 + r_11_2[J_11] .* Z_11_2;

Could we restrict map to just replace for loops with something like


  // Imagine in c++ elewise_mul_add recursivly multiples and accumulates vectors like
  template <typename Vec1, typename Vec2, typename... Args>
  auto elewise_mul_add(const Vec1& v1, const Vec2& v2, const Args&... args) {
    return v1.array() * v2.array() + elewise_mul_add(args...);
  }

Then make map like

  mu += map(elewise_mul_add, r_1_1[J_1] , Z_1_1 , r_1_2[J_1] , Z_1_2 ,
    r_2_1[J_2] , Z_2_1 ,
    r_3_1[J_3] , Z_3_1 , r_3_2[J_3] , Z_3_2 ,
    r_4_1[J_4] , Z_4_1 ,
    r_5_1[J_5] , Z_5_1 ,
    r_6_1[J_6] , Z_6_1 , r_6_2[J_6] , Z_6_2 ,
    r_7_1[J_7] , Z_7_1 , r_7_2[J_7] , Z_7_2 ,
    r_8_1[J_8] , Z_8_1 ,
    r_9_1[J_9] , Z_9_1 ,
    r_10_1[J_10] , Z_10_1 ,
    r_11_1[J_11] , Z_11_1 , r_11_2[J_11] , Z_11_2);

Is that possible? Another options here is that instead of using tbb in map for elementwise functions we can expose gpu functions. We’re still working on the reverse mode elementwise operations for vectors and matrices but could probably do something like the below in map

template <typename Func, typename... Args>
auto map(Func& func, Args&&... args) {
  #ifdef STAN_OPENCL
  // differ to matrix_cl func/constructors
  return map_impl(func, to_matrix_cl(args)...);
  #else
  return map_impl(func, args...);
  #endif
}

Sounds all cool… I think I found out what slows down my brms program and a map won’t help. I will likely end up with lots of stanvar calls and bypass unfortunately quite a bit of brms facilities. Hopefully that can help to get this pattern into brms at some point.

I would still think a map will be useful (if used with few shared arguments)

Anything in particular? It feels like a map over loops and reduce_sum over the lpdf would probs help brms stuff while being generic’ish enough for a good lot of models

Well, brms has the bad habit of copying all the parameters to vector which correspond in size to the data set you analyse. That is not very efficient if many of these are redundant copies as is the case in my application. Avoiding that should give a decent speedup… I thought… but right now I am a little confused.

But yeah… for loops should be replaced with map’s, of course.

In this model I don’t think it’s obvious why you’d break it into a map and a reduce. Reduce sum already has a map stage.

I’d divert this as a question to @paul.buerkner, would it be easier to add generic code with parallelism if you could break up sections into just maps and others into reduce_sum? Alt Q is what sort of signature would you be looking for to add parallelism to brms?