Stanc3 parallel reduce_sum

I am opening another thread as the parallel autodiff v3 one seems long enough already. This one is oriented towards trying to add stanc3 support for reduce_sum. I am far from a stanc3 guru, but love to learn new stuff, especially with a real task like this.

I took the model from https://github.com/stan-dev/math/wiki/Parallel-reduce_sum-project-page and trimmed it down to a basic example https://gist.github.com/rok-cesnovar/832f9752fc2a8ba7ae7b3da3eb5f2743 (EDIT: see revision 1)

So the first request would be to someone to double check that I trimmed it correctly. Is this how this was intended for use?

For now I managed to get stanc3 to output the following hpp file https://gist.github.com/rok-cesnovar/3fd4f2326a667cd9e31c7e14a6ee32f5 (EDIT: see revision 1)

I havent setup cmdstan with the reduce_sum branch yet to try if it compiles and all (will do that tomorrow).

If someone skimms through the hpp to let me know if this is close to what was expected or if I completely missed the point.

Thanks.

cc @wds15, @bbbales2, @stevebronder

2 Likes

Right now the arguments go:

shared_array, grainsize, stream, ...

Like:

template <typename ReduceFunction, typename Vec, require_vector_like_t<Vec>...,
          typename... Args>
auto reduce_sum(const Vec& vmapped, std::size_t grainsize, std::ostream* msgs,
                const Args&... args)

The code you generated looks fine, just the arguments look like:

grainsize, y, ..., stream

I don’t mind if grainsize and y get swapped around, but stream needs to go before the variadic bit for the C++ type deduction to work properly.

1 Like

Also this looks awesome I am so excited to try this thanks!

1 Like

Thanks. Good call on the stream and variadic stuff.
I am almost sure this is missing one of the functors from https://discourse.mc-stan.org/uploads/short-url/5hKzxFfjOZG9cyX3Jz09olvCPjO.hpp. My first goal here is to see what is missing in this initial generated code to get it to compile and run.

Also is the example Stan code here correct then?

that example goes grainsize, y ...

So this Stan code:

  real lpmf = reduce_sum(hierarchical_reduce, grainsize, y, log_lambda_group, gidx);

Compiles to:

        lpmf = reduce_sum<hierarchical_reduce_functor__>(grainsize, y,
                 log_lambda_group, gidx, pstream__);

And this:

real hierarchical_reduce(int start, int end, int[] y_slice, real[] log_lambda_group, int[] gidx) {
    return poisson_log_lpmf(y_slice| log_lambda_group[gidx[start:end]]);   
  }

Compiles to:

template <typename T3__>
typename boost::math::tools::promote_args<T3__>::type
hierarchical_reduce(const int& start, const int& end,
                    const std::vector<int>& y_slice,
                    const std::vector<T3__>& log_lambda_group,
                    const std::vector<int>& gidx, std::ostream* pstream__) {
  using local_scalar_t__ = typename boost::math::tools::promote_args<T3__>::type;
  const static bool propto__ = true;
  (void) propto__;
  
  try {
    current_statement__ = 43;
    return poisson_log_lpmf<false>(y_slice,
             rvalue(log_lambda_group,
               cons_list(
                 index_multi(rvalue(gidx,
                               cons_list(index_min_max(start, end),
                                 nil_index_list()), "gidx")),
                 nil_index_list()), "log_lambda_group"));
  } catch (const std::exception& e) {
    stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
  }
  
}

struct hierarchical_reduce_functor__ {
template <typename T3__>
typename boost::math::tools::promote_args<T3__>::type
operator()(const int& start, const int& end, const std::vector<int>& y_slice,
           const std::vector<T3__>& log_lambda_group,
           const std::vector<int>& gidx, std::ostream* pstream__)  const 
{
return hierarchical_reduce(start, end, y_slice, log_lambda_group, gidx,
         pstream__);
}
};

I think that’s all you should need.

I think that Stan code is from earlier prototypes. I’ve only been working with the C++.

1 Like

But other than the order of arguments the Stan model looks good?

I assume you’ll get a bajillion syntax errors cause that’s what compilers do, but if I was measuring this code in inches, I’d say yes, it’s likely all there.

Ok, take 2:

model: https://gist.github.com/rok-cesnovar/832f9752fc2a8ba7ae7b3da3eb5f2743
hpp: https://gist.github.com/rok-cesnovar/3fd4f2326a667cd9e31c7e14a6ee32f5

changes:

  • order of shared_array, grainsize
  • fixed stream positioning

Something like this should also compile:

functions {
  real my_func(int start, int end, real[] y_slice, real mu, real sigma) {
    return normal_lpdf(y_slice | mu, sigma);
  }
}

parameters {
  real a[5];
}

model {
  target += reduce_sum(my_func, a, 1, 0.0, 1.0));
}

And be directly comparable to:

parameters {
  real a[5];
}

model {
  a ~ normal(0, 1);
}

Thanks for the simpler model.

That produces https://gist.github.com/rok-cesnovar/6588718f07d63c0ecba25ad7f3fe900d

I will get this up on a branch tomorrow, sorry for these awful gist links.

Move the pstream argument inside the functor arguments as well (but don’t move in the actual call to my_func!). Change:

struct my_func_functor__ {
template <typename T2__, typename T3__, typename T4__>
typename boost::math::tools::promote_args<T2__, T3__,
T4__>::type
operator()(const int& start, const int& end,
           const std::vector<T2__>& y_slice, const T3__& mu,
           const T4__& sigma, std::ostream* pstream__)  const 
{
return my_func(start, end, y_slice, mu, sigma, pstream__);
}
};

To:

struct my_func_functor__ {
template <typename T2__, typename T3__, typename T4__>
typename boost::math::tools::promote_args<T2__, T3__,
T4__>::type
operator()(const int& start, const int& end,
           const std::vector<T2__>& y_slice, std::ostream* pstream__, const T3__& mu,
           const T4__& sigma)  const 
{
return my_func(start, end, y_slice, mu, sigma, pstream__);
}
};

But yeah just start compiling this tomorrow and sorting through the errors. Looks all there to me but I could be wrong.

1 Like

I manually fixed the arguments in the functor like you proposed above and it compiled on first try :) might as well go to the lottery today haha

Great start to this.

3 Likes

I have this running now for up to 3 additional arguments (+ container) of any type. Will make a WIP PR for stanc3 later in the afternoon. Will also post a linux binary so you can test this out in cmdstan.

Will probably need some assistance to make this signature applicable to any number of argument. But its a start I guess.

Did you even manage to make the 3 arguments optional or do we have to use 3 arguments?

Awesome!

Thanks so much… this will make it a lot easier to benchmark this with a few more models.

It supports 0 (not sure its a use case), 1, 2 or 3 arguments. I am not supporting all types currently (like vector[, , , , ,]) but it should cover most.

1 Like

Update, WIP PR open https://github.com/stan-dev/stanc3/pull/451 with the linux stanc3 binary posted in a comment.

The normal model compiles, the cleaned up example doesnt yet and I am not sure exactly why. Any comments welcome.

Also, if anyone has some additional example models I could test this that would be fantastic.

Whow! Thanks so much for this! I was able to make the poisson benchmark model use the new compiler. poisson-hierarchical-scale-v3-stanc3-A.stan (4.0 KB)

It is so cool not having to fiddle around with C++, but just compile the stan model with the magic new stanc3 compiler.

I used docker on my Mac to do the stanc3 compilation and then everything was run on my MacOS Catalina.

We should start thinking about how we merge this into the repositories. So finalize the interface, possibly split up the math PRs into two stages (tuple tools + reduce_sum), and then the stanc3 parser.

The results are very nice:

grainsize = 0 speedup

grainsize=0 walltime

grainsize >0 speedup

grainsize > 0 walltime

2 Likes