Parallel autodiff v4

Bob_Carpenter · February 29, 2020, 11:54pm

I’m afraid not. The plan for the future is to allow versions of _lpdf that allow dropping of normalizing constants.

wds15 · March 4, 2020, 3:24pm

I wanted to test the new parallel stuff on one of my “production” models which is the one in the R package OncoBayes2. Inferences for this model get easily slow if users switch on the so-called EXNEX functionality which means the model will have to integrate over a large mixture model space. The runtimes for this model are critical to me as we run this model with SBC and do operating characteristics runs (per scenario we got 1k simulated trials and per trial 5-10 model runs…and easily 5-10 scenarios). Since we throw this onto our cluster with >1.5k cores I don’t care about perfect efficiency and the results are really promising (500 warmup, 500 iterations):

1 thread: 44s
2 threads: 36s = 1.2x faster
3 threads: 23s = 1.9x faster

As I only have 3 groups defined in the toy example it only makes sense to use 3 cores, not more.

Porting the very complex program to use reduce_sum was really easy (big thanks to @rok_cesnovar for supporting any number of arguments…this is what we need). So overall, I am really happy with it.

Here are the codes for it: blrm_exnex_rs.stan (29.1 KB) combo3_complex.data.R (3.9 KB) combo3.R (2.8 KB)

@bbbales2 which example bothers you the most? I can have a look at it… but I really think we should push this in now into develop. It’s absolutely ready.

The above is run on macOS… as before I used docker to make use of the Linux stanc3 binary from @rok_cesnovar. This time I also worked out a small bash script which automates the docker call:

#!/bin/bash

STAN=$1

echo $STAN
echo $PWD

docker run --mount type=bind,source="$(pwd)",target=/app ubuntu /app/stanc /app/$STAN

This is super-convenient as you only have to copy the stanc binary into the same directory as the script above and then you can use the stanc binary with any stan program lying in one of the sub-directories. So I named the above compile-stat.sh and I have in the same dir the stanc binary, then I can do

./compile-stan.sh blrm_exnex_rs.stan

on my Mac. Docker is really magic! The docker image ubuntu is the latest ubuntu docker thing.

bbbales2 · March 4, 2020, 4:04pm

Thanks!

Agreed. I’m really happy with how the interface usability is turning out.

Any of them. The amount of work in all of them can scale to be very large.

The one here (base.stan) probably can scale the most. It’s a basic hierarchical model. I forget how many parameters it has.

The nbme problem (in the first post) has a large number of shared parameters (like 800).

The bpl problem Andre posted (here) can scale arbitrarily. It has a small number of parameters but more computation.

bbbales2 · March 4, 2020, 4:09pm

(The first two are pretty realistic examples of the types of models I hoped we could accelerate with this framework)

Edit: And the third seems somehow easier since there’s less memory access – though there’s always memory being used up by autodiff ops.

stevebronder · March 6, 2020, 4:44am

@bbbales2 running your perf tests rn, wanted to sanity check the below is the right way to just get the time for count_lpdf in your benchmark with no threading. Or would it be better to just set the number of threads to zero?

gist.github.com

https://gist.github.com/SteveBronder/1a71d48d0ec685d79e5b140183a4013a#file-ben_test-cpp-L52

ben_test.cpp

#include <stan/math/prim/core.hpp>
#include <stan/math.hpp>
#include <gtest/gtest.h>
#include <algorithm>
#include <sstream>
#include <tuple>
#include <vector>

std::ostream* msgs = nullptr;

This file has been truncated. show original

make_graph.R

library(data.table)
library(ggplot2)
bench_dt = fread("./test_par.csv")
bench_wide_dt = dcast(bench_dt, datasize + grainsize + worksize ~ which_parallel, 
  value.var = "time", fun.aggregate = function(x) mean(x, na.rm = TRUE))
bench_wide_dt[, div := normie/reduce_sum]
ggplot(bench_wide_dt, aes(x = datasize, y = div, color = as.factor(worksize))) +
  geom_point() +
  facet_wrap(~grainsize)

stevebronder · March 6, 2020, 5:26am

Asking cuz I ran it and relative to one thread it looks pretty good imo! Scales surprisingly well with the # of threads I throw at it

bbbales2 · March 6, 2020, 5:20pm

Yeah that test scales well.

Could be a bug in the test (not testing what I think it is).

Presumably we could write a test that does something that scales badly (like the actual models do) and we’d have some clue about what’s going wrong in the actual models.

bbbales2 · March 6, 2020, 5:21pm

I just set the number of threads to 1 to compare to single core.

stevebronder · March 8, 2020, 1:23am

So I ran bpl with perf on bpl_parallel and did a head tilt when it showed the below in the call stack

    17.71%     0.62%  bpl_parallel  bpl_parallel         [.] std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::~vector
            |          
            |--17.09%--std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::~vector
            |          |          
            |          |--13.43%--_int_free
            |          |          
            |           --2.97%--__GI___libc_free (inlined)

So 17% of the time is spent creating and destroying that std::vector<std::vector<int>> since that calls the deep_copy which iterates through the outer std::vector N times.

STAN_NUM_THREADS=32 sudo STAN_NUM_THREADS=32 perf record -g --freq=max --call-graph dwarf -d --phys-data --per-thread ./examples/bpl/bpl_parallel data file=’./examples/bpl/bpl.data.R’ sample

sudo perf report --call-graph --stdio -G

stevebronder · March 8, 2020, 1:24am

I’ll have a branch soon that does some better memory stuff, but tbh I think the majority of the issue is doing

    template <typename T, typename = require_arithmetic_t<scalar_type_t<T>>>
    inline decltype(auto) deep_copy(T&& arg) {
      return std::forward<T>(arg);
    }

instead of

    template <typename T, typename = require_arithmetic_t<scalar_type_t<T>>>
    inline T deep_copy(const T& arg) {
      return arg;
    }

in reduce_sum

At least running the bpl example it went from 2x faster to 4x faster

bbbales2 · March 8, 2020, 12:33pm

Niiiiiiice! Gogogogogo

Edit: Actually, Amaaaaazing! Nice is an understatement

andrjohns · March 9, 2020, 2:50am

That’s really interesting. For a newbie to perfect forwarding, where did the big performance hit come from?

stevebronder · March 9, 2020, 3:50am

The example here is sort of a cheap trick to force return value optimization (RVO). Normally C++ doesn’t allow RVO for parameter values, but in this instance we know that if we put in a temp we want it moved out, so we just force it to move if it can and otherwise pass by reference.

    template <typename T, typename = require_arithmetic_t<scalar_type_t<T>>>
    inline decltype(auto) deep_copy(T&& arg) {
      return std::forward<T>(arg);
    }

stevebronder · March 9, 2020, 3:50am

The branch I have does a lot of other stuff related to pf, though I need to run perf etc. to see what parts are actually useful

andrjohns · March 9, 2020, 12:06pm

Huh, cool. Thanks for the explanation!

rok_cesnovar · March 9, 2020, 12:48pm

I will set aside some time this week to finalize and cleanup the stanc3 branch. We have about a month to the next release and I think we should be able to get this in.

bbbales2 · March 9, 2020, 12:49pm

Make sure and sync with @rybern. He said something about the variadic functions in the last group meeting.

rok_cesnovar · March 9, 2020, 1:01pm

Ok, tnx. I wasnt able to attend the last few weeks.

@rybern do you plan on working on a general approach for variadic functions? We can open up a separate topic or Github issue if you feel there is something more to discuss here.

wds15 · March 9, 2020, 2:46pm

Great!

I also plan to spend some time on the math bits in order to make this ready for the next release if possible, but I don’t want to make any definite commitments on this at the moment.

This is really cool! Thanks a lot for looking into this. I am very curious on your new branch… and if you can boil down the number of changes to keep things simple, that would be awesome.

bbbales2 · March 9, 2020, 9:11pm

I also had forgotten but we should get this moving in design-doc land. I’ll try to write something up tomorrow.

Topic		Replies	Views
Stanc3 parallel reduce_sum Developers	21	1110	April 9, 2020
Parallel reduce in the Stan language Developers	12	1235	April 11, 2019
Why does reduce_sum include the second argument? Modeling specification	7	590	May 23, 2020
Proposed parallelism RFC - Stan language bits Developers	14	1044	July 9, 2019
Variable scope & reduce_sum General	6	399	October 7, 2020

Parallel autodiff v4

Related topics