Using reduce_sum for ODE parallelisation

Philipp-Neubauer · March 24, 2021, 11:51pm

Hi wonderful Stan folks -

I’m looking to parallelize an (otherwise slow) ODE model. To this end I installed rstan 2.26.1 dev (& StanHeaders) according to this post; verified that the original (non-parallel) model compiled and ran under that config.

Next step; following the reduce_sum example here I set up the partial sum functions for the likelihoods including the ODE evaluation (there are >3k observation for which to run this). The parallel code is here growth_model_ode_update.stan (12.1 KB) .

Note there is one partial sum for a second likelihood that does not depend on the ODE and does not throw errors…so only the ODE config gives trouble. After a few hours of debugging I finally got to this error message:

Error in stanc(model_code = model_code, model_name = model_name, verbose = verbose,  : 
  0

Semantic error in 'string', line 199, column 12 to line 200, column 157:

Ill-typed arguments supplied to function 'reduce_sum'. Available arguments:
(T[], int, int, ...) => real, T[], int, ...
(T[,], int, int, ...) => real, T[,], int, ...
(T[,,], int, int, ...) => real, T[,,], int, ...
(T[,,,], int, int, ...) => real, T[,,,], int, ...
(T[,,,,], int, int, ...) => real, T[,,,,], int, ...
(T[,,,,,], int, int, ...) => real, T[,,,,,], int, ...
(T[,,,,,,], int, int, ...) => real, T[,,,,,,], int, ...
Where T is any one of int, real, vector, row_vector or matrix.
Instead supplied arguments of incompatible type: (vector, int, int, real, real, real, real, vector, vector, vector, data real[], data real[], data real[], data int[]) => real, vector, int, int, real, real, real, real, vector, vector, vector, real[], real[], real[], int[]

Calls: stan_model -> stanc_builder -> stanc
Execution halted

The error seems the same as discussed here:. Based on taht, I added data types to the partial sum function, but run into the rror that the “data” type is not maintained through the function call (I think), However, given subsequent benchmarks here, I wonder if this should work in some config other than the one I am trying here?

Wondering if @wds15 may have some insights?

Thanks lots in advance!

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C               LC_TIME=en_NZ.UTF-8        LC_COLLATE=en_NZ.UTF-8    
 [5] LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8    LC_PAPER=en_NZ.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.26.1       StanHeaders_2.26.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         compiler_4.0.4     pillar_1.5.1       prettyunits_1.1.1  tools_4.0.4       
 [6] pkgbuild_1.2.0     jsonlite_1.7.2     lifecycle_1.0.0    tibble_3.1.0       gtable_0.3.0      
[11] pkgconfig_2.0.3    rlang_0.4.10       DBI_1.1.1          cli_2.3.1          rstudioapi_0.13   
[16] parallel_4.0.4     curl_4.3           loo_2.4.1          gridExtra_2.3      dplyr_1.0.5       
[21] generics_0.1.0     vctrs_0.3.6        tidyselect_1.1.0   stats4_4.0.4       grid_4.0.4        
[26] glue_1.4.2         inline_0.3.17      R6_2.4.1           processx_3.4.5     fansi_0.4.2       
[31] purrr_0.3.4        ggplot2_3.3.3      callr_3.5.1        magrittr_2.0.1     codetools_0.2-17  
[36] matrixStats_0.56.0 scales_1.1.1       ps_1.3.3           ellipsis_0.3.1     assertthat_0.2.1  
[41] colorspace_2.0-0   V8_3.4.0           utf8_1.1.4         RcppParallel_5.0.3 munsell_0.5.0     
[46] crayon_1.3.4

Funko_Unko · March 25, 2021, 7:33am

Without any experience whatsoever, the error msg makes it look like the first argument has to be an array. E.g. an array of reals or of vectors, but not a vector.

wds15 · March 25, 2021, 7:59am

Hi!

I just looked into the model and @Funko_Unko is correct. The first argument must a real array instead of a vector.

Sebastian

EDIT: And you should really upgrade to the new variadic ode interface which is 25% faster than the old one! It’s also a lot easier to write down your ODEs with the new interface. So use ode_rk45 instead of integrate_ode_rk45.

Philipp-Neubauer · March 25, 2021, 10:01am

Great - thank you both! that did the trick; checking out the new interface as well - it does look a lot cleaner.

Appreciate the quick help!

Philipp-Neubauer · March 25, 2021, 9:09pm

PS - though the translation in stanc now works, I am stuck with a compile error - are there specific downstream requirements for the new ODE functions?

I should add that this is with a fresh install of RcppEigen_0.3.3.9.1 before installing the rstan 2.26.1 as per post above.

    make cmd is
  make -f '/usr/lib/R/etc/Makeconf' -f '/usr/share/R/share/make/shlib.mk' CXX='$(CXX14) $(CXX14STD)' CXXFLAGS='$(CXX14FLAGS)' CXXPICFLAGS='$(CXX14PICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX14LDFLAGS)' SHLIB_LD='$(SHLIB_CXX14LD)' SHLIB='file8617bae0a63.so' OBJECTS='file8617bae0a63.o'

make would use
g++ -std=gnu++14 -I"/usr/share/R/include" -DNDEBUG   -I"/usr/local/lib/R/site-library/Rcpp/include/"  -I"/usr/local/lib/R/site-library/RcppEigen/include/"  -I"/usr/local/lib/R/site-library/RcppEigen/include/unsupported"  -I"/usr/local/lib/R/site-library/BH/include" -I"/usr/local/lib/R/site-library/StanHeaders/include/src/"  -I"/usr/local/lib/R/site-library/StanHeaders/include/"  -I"/usr/local/lib/R/site-library/RcppParallel/include/"  -I"/usr/local/lib/R/site-library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -DBOOST_NO_AUTO_PTR  -include '/usr/local/lib/R/site-library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1     -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-V28x5H/r-base-3.6.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c file8617bae0a63.cpp -o file8617bae0a63.o
if test  "zfile8617bae0a63.o" != "z"; then \
  echo g++ -std=gnu++14 -shared -L"/usr/lib/R/lib" -Wl,-Bsymbolic-functions -Wl,-z,relro -o file8617bae0a63.so file8617bae0a63.o  '/usr/local/lib/R/site-library/rstan/lib//libStanServices.a' -L'/usr/local/lib/R/site-library/StanHeaders/lib/' -lStanHeaders -L'/usr/local/lib/R/site-library/RcppParallel/lib/' -ltbb   -L"/usr/lib/R/lib" -lR; \
  g++ -std=gnu++14 -shared -L"/usr/lib/R/lib" -Wl,-Bsymbolic-functions -Wl,-z,relro -o file8617bae0a63.so file8617bae0a63.o  '/usr/local/lib/R/site-library/rstan/lib//libStanServices.a' -L'/usr/local/lib/R/site-library/StanHeaders/lib/' -lStanHeaders -L'/usr/local/lib/R/site-library/RcppParallel/lib/' -ltbb   -L"/usr/lib/R/lib" -lR; \
fi
Error in compileCode(f, code, language = language, verbose = verbose) : 
  /usr/local/lib/R/site-library/RcppEigen/include/Eigen/src/Core/ProductEvaluators.h:35:90:   required from ‘Eigen::internal::evaluator<Eigen::Product<Lhs, Rhs, Option> >::evaluator(const XprType&) [with Lhs = Eigen::Product<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<double, double>, const Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<double>, const Eigen::Matrix<double, 1, -1> >, const Eigen::Transpose<Eigen::Matrix<double, -1, 1> > >, Eigen::Matrix<double, -1, -1>, 0>; Rhs = Eigen::Matrix<double, -1, 1>; int Options = 0; Eigen::internal::evaluator<Eigen::Product<Lhs, Rhs, Option> >::XprType = Eigen::Product<Eigen::Product<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<double, double>, const Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<double>, const Eigen::Matrix<double, 1, -1> >, const Eigen::Transpose<Eigen::Matrix<double, -1, 1> > >, Eigen::Matrix<double, -1, -1>, 0>, Eigen::Matrix<double, -1, 1>, 0>]’/usr/local/lib/R/site
Calls: stan_model ... cxxfunctionplus -> <Anonymous> -> cxxfunction -> compileCode
Error in sink(type = "output") : invalid connection
Calls: stan_model -> cxxfunctionplus -> sink
Execution halted

wds15 · March 25, 2021, 9:27pm

That are rstan related issues…maybe @hsbadr has a clue?

hsbadr · March 25, 2021, 10:26pm

The first error is valid, from stanc3. The second isn’t clear, but it could be related to the TBB headers, if @rok_cesnovar hasn’t updated the packages. Check the solution posted at Threading with backend = "rstan".

Philipp-Neubauer · March 25, 2021, 10:28pm

Thanks - just tried to see if it works with cmdstanr; it only works when I compile without cpp_options = list(stan_threads = TRUE); when I turn the flag on the job fails with:

Chain 4 num_threads = 4 
Chain 4 growth_model_ode_update_par_threads: stan/lib/stan_math/lib/eigen_3.3.9/Eigen/src/Core/DenseCoeffsBase.h:408: Eigen::DenseCoeffsBase<Derived, 1>::Scalar& Eigen::DenseCoeffsBase<Derived, 1>::operator[](Eigen::Index) [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 1>::Scalar = double; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.
Chain 4 growth_model_ode_update_par_threads: stan/lib/stan_math/lib/eigen_3.3.9/Eigen/src/Core/DenseCoeffsBase.h:408: Eigen::DenseCoeffsBase<Derived, 1>::Scalar& Eigen::DenseCoeffsBase<Derived, 1>::operator[](Eigen::Index) [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 1>::Scalar = double; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.
Chain 4 growth_model_ode_update_par_threads: stan/lib/stan_math/lib/eigen_3.3.9/Eigen/src/Core/DenseCoeffsBase.h:408: Eigen::DenseCoeffsBase<Derived, 1>::Scalar& Eigen::DenseCoeffsBase<Derived, 1>::operator[](Eigen::Index) [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 1>::Scalar = double; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.

Philipp-Neubauer · March 25, 2021, 10:34pm

thanks! will try and follow that thread.

Philipp-Neubauer · March 25, 2021, 11:47pm

@hsbadr That worked, thanks! Followed the steps outlined in the other post and it now runs using rstan. It does seem, though, that the TBB install (or perhaps just the RcppParallel update?) are incompatible with cmdstan…just FYI

Philipp-Neubauer · March 26, 2021, 3:02am

PS - spoke too soon it seems; I can’t seem to get the multi-threading to work. The model runs fine, but using only nchains CPU rather than nchains*threads_per_chain. I verified that it works fine with brms and rstan backend, so the install is now capable, it just seems that I am not…

I set:

require(rstan)
rstan_options(threads_per_chain = 4)

gmod <- stan_model(stanc_ret = stanc_builder('growth_model_ode_update_par.stan'),auto_write = T,verbose = F)

growth_mod <- sampling(gmod,
                   data    = mod_data,
                   chains  = chains,
                   thin    = thin,
                   init    = inits_ud,
                   refresh = 10,

                   warmup  = warmup,
                   iter    = warmup + niter,
                   control  = list(max_treedepth = 13,
                                   adapt_delta   = 0.95))

From Sys.getenv("STAN_NUM_THREADS") it seems that the rstan_options get parsed OK, but I still only get one thread per chain running:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                     
 130727 philipp   20   0 4769948   2.6g  15120 R 100.0   8.3   0:34.20 R                                                                                                                                                                           
 130728 philipp   20   0 4769956   2.6g  15120 R 100.0   8.3   0:34.13 R                                                                                                                                                                           
 130729 philipp   20   0 4769964   2.6g  15120 R 100.0   8.3   0:34.12 R                                                                                                                                                                           
 130726 philipp   20   0 4769940   2.6g  15120 R  99.7   8.3   0:34.23 R

whereas with brms these go to 400% per proc. I checked the brms code but couldn’t see what I might be missing here…

I tried setting cores = 16 in the sampling call etc, but no luck so far…

my updated model is now: growth_model_ode_update_par.stan (11.1 KB)

hsbadr · March 30, 2021, 11:44am

The Stan code should use reduce_sum to utilize within-chain parallelism. For example, setting threads argument in brms::brm will correctly generate the code. rstan doesn’t automatically modify the code for this.

Philipp-Neubauer · March 30, 2021, 6:44pm

Thanks for the reply - yes I did that (the model attached in the previous post has the reduce_sum function etc. When I call optimising it seems to work fine (for threads_per_chain=4):

PID    USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
215776 philipp   20   0 1585504 407640  72168 R 386.7   1.2   4:32.46 R

But in sampling mode the chains don’t seem to use threading. I’ve verified that the same works fine with brms and the model from the threading example in the docs.

Philipp-Neubauer · April 2, 2021, 9:29pm

FWIW - I have figured out that its the generated quantities block that leads to threading not happening; when I comment it out, threading works just fine in sampling mode. So for now I am tempted to wrap those bits in a function that I can export and do the post-processing in R…

wds15 · April 2, 2021, 10:00pm

Can u explain a bit more as to when threading does not happen when it should. Is this a bug?

Philipp-Neubauer · April 2, 2021, 10:10pm

In growth_model_ode_update_par.stan (11.2 KB) - threading works fine with the rstan option

rstan_options(threads_per_chain = 8)

but only when I comment out the generated quantities block. When the block is present, the CPU only ever shows 100% use per chain, (whereas with the above option without the GQ block I get this for 2 chains):

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                      
128374 philipp   20   0 1962976 568508  15232 S 743.4   1.7 284:58.71 R                                            
128375 philipp   20   0 1966100 565028  15232 S 738.7   1.7 285:43.22 R

I was wondering if the non-parallel execution of the GQ block takes so much time that I never get to see the parallel execution of the reduce_sum given update intervals of the cpu-monitor. But that seems unlikely? Let me know if there’s any other specific bits of info that could be useful here…

wds15 · April 3, 2021, 7:43am

I see. In the Gq block you do not use parallelism and then the execution takes very long in the gq part. As a result parallelization in the model block is not very effective. So just run first the inference and then in another run the gq block. Maybe use the gq run facility for that.

Funko_Unko · April 3, 2021, 8:36am

Then it’s also possible to use multiple threads per (output of) chain, isn’t it?

wds15 · April 3, 2021, 8:50am

You can run different chains in parallel, of course. As reduce_sum only returns a scalar it’s not useful for most gq simulations where you usually want the output at each time point rather than just the log-lik value.

Funko_Unko · April 3, 2021, 8:51am

Ah, okay, so multiple threads per chain output would not simply divide the samples among them?

Topic		Replies	Views
Help with reduce_sum Modeling	32	1617	August 4, 2020
Proposed parallelism RFC - Stan language bits Developers	14	1106	July 9, 2019
Long message when compiling reduce_sum model Interfaces cmdstanr	0	255	December 14, 2023
Four chains vs four jobs General cmdstan	28	439	June 19, 2024
Stuck at Warmup iteration with no error : CmdStanR CmdStan techniques , fitting-issues	48	3534	April 21, 2020

Using reduce_sum for ODE parallelisation

Related topics