Using reduce_sum for ODE parallelisation

Hi wonderful Stan folks -

I’m looking to parallelize an (otherwise slow) ODE model. To this end I installed rstan 2.26.1 dev (& StanHeaders) according to this post; verified that the original (non-parallel) model compiled and ran under that config.

Next step; following the reduce_sum example here I set up the partial sum functions for the likelihoods including the ODE evaluation (there are >3k observation for which to run this). The parallel code is here growth_model_ode_update.stan (12.1 KB) .

Note there is one partial sum for a second likelihood that does not depend on the ODE and does not throw errors…so only the ODE config gives trouble. After a few hours of debugging I finally got to this error message:

Error in stanc(model_code = model_code, model_name = model_name, verbose = verbose,  : 
  0

Semantic error in 'string', line 199, column 12 to line 200, column 157:

Ill-typed arguments supplied to function 'reduce_sum'. Available arguments:
(T[], int, int, ...) => real, T[], int, ...
(T[,], int, int, ...) => real, T[,], int, ...
(T[,,], int, int, ...) => real, T[,,], int, ...
(T[,,,], int, int, ...) => real, T[,,,], int, ...
(T[,,,,], int, int, ...) => real, T[,,,,], int, ...
(T[,,,,,], int, int, ...) => real, T[,,,,,], int, ...
(T[,,,,,,], int, int, ...) => real, T[,,,,,,], int, ...
Where T is any one of int, real, vector, row_vector or matrix.
Instead supplied arguments of incompatible type: (vector, int, int, real, real, real, real, vector, vector, vector, data real[], data real[], data real[], data int[]) => real, vector, int, int, real, real, real, real, vector, vector, vector, real[], real[], real[], int[]

Calls: stan_model -> stanc_builder -> stanc
Execution halted

The error seems the same as discussed here:. Based on taht, I added data types to the partial sum function, but run into the rror that the “data” type is not maintained through the function call (I think), However, given subsequent benchmarks here, I wonder if this should work in some config other than the one I am trying here?

Wondering if @wds15 may have some insights?

Thanks lots in advance!


> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C               LC_TIME=en_NZ.UTF-8        LC_COLLATE=en_NZ.UTF-8    
 [5] LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8    LC_PAPER=en_NZ.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.26.1       StanHeaders_2.26.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         compiler_4.0.4     pillar_1.5.1       prettyunits_1.1.1  tools_4.0.4       
 [6] pkgbuild_1.2.0     jsonlite_1.7.2     lifecycle_1.0.0    tibble_3.1.0       gtable_0.3.0      
[11] pkgconfig_2.0.3    rlang_0.4.10       DBI_1.1.1          cli_2.3.1          rstudioapi_0.13   
[16] parallel_4.0.4     curl_4.3           loo_2.4.1          gridExtra_2.3      dplyr_1.0.5       
[21] generics_0.1.0     vctrs_0.3.6        tidyselect_1.1.0   stats4_4.0.4       grid_4.0.4        
[26] glue_1.4.2         inline_0.3.17      R6_2.4.1           processx_3.4.5     fansi_0.4.2       
[31] purrr_0.3.4        ggplot2_3.3.3      callr_3.5.1        magrittr_2.0.1     codetools_0.2-17  
[36] matrixStats_0.56.0 scales_1.1.1       ps_1.3.3           ellipsis_0.3.1     assertthat_0.2.1  
[41] colorspace_2.0-0   V8_3.4.0           utf8_1.1.4         RcppParallel_5.0.3 munsell_0.5.0     
[46] crayon_1.3.4      

Without any experience whatsoever, the error msg makes it look like the first argument has to be an array. E.g. an array of reals or of vectors, but not a vector.

Hi!

I just looked into the model and @Funko_Unko is correct. The first argument must a real array instead of a vector.

Sebastian

EDIT: And you should really upgrade to the new variadic ode interface which is 25% faster than the old one! It’s also a lot easier to write down your ODEs with the new interface. So use ode_rk45 instead of integrate_ode_rk45.

Great - thank you both! that did the trick; checking out the new interface as well - it does look a lot cleaner.

Appreciate the quick help!

PS - though the translation in stanc now works, I am stuck with a compile error - are there specific downstream requirements for the new ODE functions?

I should add that this is with a fresh install of RcppEigen_0.3.3.9.1 before installing the rstan 2.26.1 as per post above.

    make cmd is
  make -f '/usr/lib/R/etc/Makeconf' -f '/usr/share/R/share/make/shlib.mk' CXX='$(CXX14) $(CXX14STD)' CXXFLAGS='$(CXX14FLAGS)' CXXPICFLAGS='$(CXX14PICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX14LDFLAGS)' SHLIB_LD='$(SHLIB_CXX14LD)' SHLIB='file8617bae0a63.so' OBJECTS='file8617bae0a63.o'

make would use
g++ -std=gnu++14 -I"/usr/share/R/include" -DNDEBUG   -I"/usr/local/lib/R/site-library/Rcpp/include/"  -I"/usr/local/lib/R/site-library/RcppEigen/include/"  -I"/usr/local/lib/R/site-library/RcppEigen/include/unsupported"  -I"/usr/local/lib/R/site-library/BH/include" -I"/usr/local/lib/R/site-library/StanHeaders/include/src/"  -I"/usr/local/lib/R/site-library/StanHeaders/include/"  -I"/usr/local/lib/R/site-library/RcppParallel/include/"  -I"/usr/local/lib/R/site-library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -DBOOST_NO_AUTO_PTR  -include '/usr/local/lib/R/site-library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1     -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-V28x5H/r-base-3.6.3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c file8617bae0a63.cpp -o file8617bae0a63.o
if test  "zfile8617bae0a63.o" != "z"; then \
  echo g++ -std=gnu++14 -shared -L"/usr/lib/R/lib" -Wl,-Bsymbolic-functions -Wl,-z,relro -o file8617bae0a63.so file8617bae0a63.o  '/usr/local/lib/R/site-library/rstan/lib//libStanServices.a' -L'/usr/local/lib/R/site-library/StanHeaders/lib/' -lStanHeaders -L'/usr/local/lib/R/site-library/RcppParallel/lib/' -ltbb   -L"/usr/lib/R/lib" -lR; \
  g++ -std=gnu++14 -shared -L"/usr/lib/R/lib" -Wl,-Bsymbolic-functions -Wl,-z,relro -o file8617bae0a63.so file8617bae0a63.o  '/usr/local/lib/R/site-library/rstan/lib//libStanServices.a' -L'/usr/local/lib/R/site-library/StanHeaders/lib/' -lStanHeaders -L'/usr/local/lib/R/site-library/RcppParallel/lib/' -ltbb   -L"/usr/lib/R/lib" -lR; \
fi
Error in compileCode(f, code, language = language, verbose = verbose) : 
  /usr/local/lib/R/site-library/RcppEigen/include/Eigen/src/Core/ProductEvaluators.h:35:90:   required from ‘Eigen::internal::evaluator<Eigen::Product<Lhs, Rhs, Option> >::evaluator(const XprType&) [with Lhs = Eigen::Product<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<double, double>, const Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<double>, const Eigen::Matrix<double, 1, -1> >, const Eigen::Transpose<Eigen::Matrix<double, -1, 1> > >, Eigen::Matrix<double, -1, -1>, 0>; Rhs = Eigen::Matrix<double, -1, 1>; int Options = 0; Eigen::internal::evaluator<Eigen::Product<Lhs, Rhs, Option> >::XprType = Eigen::Product<Eigen::Product<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<double, double>, const Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<double>, const Eigen::Matrix<double, 1, -1> >, const Eigen::Transpose<Eigen::Matrix<double, -1, 1> > >, Eigen::Matrix<double, -1, -1>, 0>, Eigen::Matrix<double, -1, 1>, 0>]’/usr/local/lib/R/site
Calls: stan_model ... cxxfunctionplus -> <Anonymous> -> cxxfunction -> compileCode
Error in sink(type = "output") : invalid connection
Calls: stan_model -> cxxfunctionplus -> sink
Execution halted

That are rstan related issues…maybe @hsbadr has a clue?

The first error is valid, from stanc3. The second isn’t clear, but it could be related to the TBB headers, if @rok_cesnovar hasn’t updated the packages. Check the solution posted at Threading with backend = "rstan".

Thanks - just tried to see if it works with cmdstanr; it only works when I compile without cpp_options = list(stan_threads = TRUE); when I turn the flag on the job fails with:

Chain 4 num_threads = 4 
Chain 4 growth_model_ode_update_par_threads: stan/lib/stan_math/lib/eigen_3.3.9/Eigen/src/Core/DenseCoeffsBase.h:408: Eigen::DenseCoeffsBase<Derived, 1>::Scalar& Eigen::DenseCoeffsBase<Derived, 1>::operator[](Eigen::Index) [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 1>::Scalar = double; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.
Chain 4 growth_model_ode_update_par_threads: stan/lib/stan_math/lib/eigen_3.3.9/Eigen/src/Core/DenseCoeffsBase.h:408: Eigen::DenseCoeffsBase<Derived, 1>::Scalar& Eigen::DenseCoeffsBase<Derived, 1>::operator[](Eigen::Index) [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 1>::Scalar = double; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.
Chain 4 growth_model_ode_update_par_threads: stan/lib/stan_math/lib/eigen_3.3.9/Eigen/src/Core/DenseCoeffsBase.h:408: Eigen::DenseCoeffsBase<Derived, 1>::Scalar& Eigen::DenseCoeffsBase<Derived, 1>::operator[](Eigen::Index) [with Derived = Eigen::Matrix<double, -1, 1>; Eigen::DenseCoeffsBase<Derived, 1>::Scalar = double; Eigen::Index = long int]: Assertion `index >= 0 && index < size()' failed.

thanks! will try and follow that thread.

@hsbadr That worked, thanks! Followed the steps outlined in the other post and it now runs using rstan. It does seem, though, that the TBB install (or perhaps just the RcppParallel update?) are incompatible with cmdstan…just FYI

PS - spoke too soon it seems; I can’t seem to get the multi-threading to work. The model runs fine, but using only nchains CPU rather than nchains*threads_per_chain. I verified that it works fine with brms and rstan backend, so the install is now capable, it just seems that I am not…

I set:

require(rstan)
rstan_options(threads_per_chain = 4)

gmod <- stan_model(stanc_ret = stanc_builder('growth_model_ode_update_par.stan'),auto_write = T,verbose = F)

growth_mod <- sampling(gmod,
                   data    = mod_data,
                   chains  = chains,
                   thin    = thin,
                   init    = inits_ud,
                   refresh = 10,

                   warmup  = warmup,
                   iter    = warmup + niter,
                   control  = list(max_treedepth = 13,
                                   adapt_delta   = 0.95))

From Sys.getenv("STAN_NUM_THREADS") it seems that the rstan_options get parsed OK, but I still only get one thread per chain running:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                     
 130727 philipp   20   0 4769948   2.6g  15120 R 100.0   8.3   0:34.20 R                                                                                                                                                                           
 130728 philipp   20   0 4769956   2.6g  15120 R 100.0   8.3   0:34.13 R                                                                                                                                                                           
 130729 philipp   20   0 4769964   2.6g  15120 R 100.0   8.3   0:34.12 R                                                                                                                                                                           
 130726 philipp   20   0 4769940   2.6g  15120 R  99.7   8.3   0:34.23 R  

whereas with brms these go to 400% per proc. I checked the brms code but couldn’t see what I might be missing here…

I tried setting cores = 16 in the sampling call etc, but no luck so far…

my updated model is now: growth_model_ode_update_par.stan (11.1 KB)

The Stan code should use reduce_sum to utilize within-chain parallelism. For example, setting threads argument in brms::brm will correctly generate the code. rstan doesn’t automatically modify the code for this.

Thanks for the reply - yes I did that (the model attached in the previous post has the reduce_sum function etc. When I call optimising it seems to work fine (for threads_per_chain=4):

PID    USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
215776 philipp   20   0 1585504 407640  72168 R 386.7   1.2   4:32.46 R  

But in sampling mode the chains don’t seem to use threading. I’ve verified that the same works fine with brms and the model from the threading example in the docs.

FWIW - I have figured out that its the generated quantities block that leads to threading not happening; when I comment it out, threading works just fine in sampling mode. So for now I am tempted to wrap those bits in a function that I can export and do the post-processing in R…

Can u explain a bit more as to when threading does not happen when it should. Is this a bug?

In growth_model_ode_update_par.stan (11.2 KB) - threading works fine with the rstan option

rstan_options(threads_per_chain = 8)
  • but only when I comment out the generated quantities block. When the block is present, the CPU only ever shows 100% use per chain, (whereas with the above option without the GQ block I get this for 2 chains):
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                      
128374 philipp   20   0 1962976 568508  15232 S 743.4   1.7 284:58.71 R                                            
128375 philipp   20   0 1966100 565028  15232 S 738.7   1.7 285:43.22 R           

I was wondering if the non-parallel execution of the GQ block takes so much time that I never get to see the parallel execution of the reduce_sum given update intervals of the cpu-monitor. But that seems unlikely? Let me know if there’s any other specific bits of info that could be useful here…

I see. In the Gq block you do not use parallelism and then the execution takes very long in the gq part. As a result parallelization in the model block is not very effective. So just run first the inference and then in another run the gq block. Maybe use the gq run facility for that.

Then it’s also possible to use multiple threads per (output of) chain, isn’t it?

You can run different chains in parallel, of course. As reduce_sum only returns a scalar it’s not useful for most gq simulations where you usually want the output at each time point rather than just the log-lik value.

Ah, okay, so multiple threads per chain output would not simply divide the samples among them?