Cmdstan 2.18 MPI

What about the tweaking I did re mpic++? Is there a more elegant approach?

Does it make sense that:
3 nodes 20 cores each take 148m
5 nodes 20 cores each - 100m
10 nodes 20 cores each - 122m
15 nodes 20 cores each - 132m

Maybe… don’t know. What you did is what boost recommends in case your system is not easily auto-detectable. That can mean many things.

Depends on your model and the data you are fitting… and the hardware this is running on. We have Infiniband on our cluster which should give very good scaling. On standard ethernet fabric scaling will be much worse.

… but what is your 1 core running time without MPI?

I have time on Win 10 which is 92345s - no map_rect. Unfortunately Win 10 is faster so I still have to do comparison but my guess improvement is obvious. I will reformat my PBPK problem to mpi to see if it becomes solvable in less than a day. Fortunately I have 19 subjects so I node with 20 cores will be enough. Maybe in this case just threading will be sufficient?

@linas thanks for sharing this model, as well as the original.

I’m trying to understand this. The original model is pretty straightforward, just using this an example to figure out the MPI stuff.

You’re introducing two new things, not included in the first model.

Can you please explain the bias corrections you included, mathematically?
Also, what do you mean by shards? Is this you chopping up the parameters to allocate on different cores, or what? I get nodes and laters, but now I’m thinking of broken glass or something

thanks!

Threading will already bring you very far. If you want the last bit of performance, then go with MPI… which doesn’t work on Windows though. So MPI is more efficient than threading.

1 Like

In general, Bayesian Neural Networks have bias correction which may be different for each input or for each neuron. The math is in Radford Neal’s thesis A.1 - A.3. His formulation is very general and I just ignored some terms. I just added MPI stuff to make it faster.

Shards represent number of MPI enabled threads which may be executed in sequence or in parallel based on available resources. Intuitively they represent the # of groups data and parameters are split. I borrowed this name from Linear, parallell regression. Node has some number of cores. I think (I may be wrong) you want so many cores as there are shards. So sometimes you have to get few nodes. Sorry, I am not expert on this…

1 Like

The timing for toy example was with threading. Big BNN problem was with MPI. I’ll take your word…

Well, I reworked the model to support MPI but stan started giving errors such as:

Third argument to integrate_ode_bdf (initial times) must be data only and not reference parameters.

Any ideas how to deal with this? The model is attached.
pbpkauto.stan (12.3 KB)

It seems that solution is to hardcode the values. Not too elegant but a solution… But it didn’t work either. Can somebody help?

pbpk/pbpkauto.hpp(518): error: no instance of function template “stan::math::integrate_ode_bdf” matches the argument list
argument types are: (pbpkauto_model_namespace::pbpk_functor__, std::vector<stan::math::var, std::allocatorstan::math::var>, double, std::vector<stan::math::var, std::allocatorstan::math::var>, std::vector<stan::math::var, std::allocatorstan::math::var>, std::vector<stan::math::var, std::allocatorstan::math::var>, std::vector<int, std::allocator>, std::ostream *)
stan::math::assign(y_hat_ind, integrate_ode_bdf(pbpk_functor__(), y0, 0.0, static_cast<std::vector<local_scalar_t__> >(stan::math::array_builder<local_scalar_t__ >().add(0.25).add(1).add(0.75).add(1.0).add(1.5).add(2.0).add(3.0).add(4.0).add(6.0).add(8.0).add(12.0).add(18.0).add(24.0).add(30.0).add(36.0).array()), theta_ind, static_cast<std::vector<local_scalar_t__> >(stan::math::array_builder<local_scalar_t__ >().add(1.0).array()), static_cast<std::vector >(stan::math::array_builder().add(1).array()), pstream__));
^
pbpk/pbpkauto.hpp(518): note: this candidate was rejected because arguments do not match
stan::math::assign(y_hat_ind, integrate_ode_bdf(pbpk_functor__(), y0, 0.0, static_cast<std::vector<local_scalar_t__> >(stan::math::array_builder<local_scalar_t__ >().add(0.25).add(1).add(0.75).add(1.0).add(1.5).add(2.0).add(3.0).add(4.0).add(6.0).add(8.0).add(12.0).add(18.0).add(24.0).add(30.0).add(36.0).array()), theta_ind, static_cast<std::vector<local_scalar_t__> >(stan::math::array_builder<local_scalar_t__ >().add(1.0).array()), static_cast<std::vector >(stan::math::array_builder().add(1).array()), pstream__));
^
stan/lib/stan_math/stan/math/rev/mat/functor/integrate_ode_bdf.hpp(13): note: this candidate was rejected because arguments do not match

Temporaries such as {1} inside a function are not considered as data inside functions. Thus, causes Stan to reject your integrate command. Please try to replace this

      y_hat_ind = integrate_ode_bdf(pbpk, y0, 0., 
         {0.25, 1, 0.75, 1., 1.5, 2., 3., 4., 6., 8., 12., 18., 24., 30., 36.}, theta_ind, 
         {1.}, {1});

with this

      y_hat_ind = integrate_ode_bdf(pbpk, y0, 0., 
         {0.25, 1, 0.75, 1., 1.5, 2., 3., 4., 6., 8., 12., 18., 24., 30., 36.}, theta_ind, 
         xs[1:1], xi[1:1]);

Then it should work.

Unfortunately it doesn’t. But I think I have to pass ts argument to integrate_ode via xs. Thanks for the great idea. Hopefully it will work.

Right now it gives:
argument types are: (pbpkauto_model_namespace::pbpk_functor__, std::vector<stan::math::var, std::allocatorstan::math::var>, double, std::vector<stan::math::var, std::allocatorstan::math::var>, std::vector<stan::math::var, std::allocatorstan::math::var>, std::vector<double, std::allocator>, std::vector<int, std::allocator>, std::ostream *)
stan::math::assign(y_hat_ind, integrate_ode_bdf(pbpk_functor__(), y0, 0.0, static_cast<std::vector<local_scalar_t__> >(stan::math::array_builder<local_scalar_t__ >().add(0.25).add(1).add(0.75).add(1.0).add(1.5).add(2.0).add(3.0).add(4.0).add(6.0).add(8.0).add(12.0).add(18.0).add(24.0).add(30.0).add(36.0).array()), theta_ind, stan::model::rvalue(xs, stan::model::cons_list(stan::model::index_min_max(1, 1), stan::model::nil_index_list()), “xs”), stan::model::rvalue(xi, stan::model::cons_list(stan::model::index_min_max(1, 1), stan::model::nil_index_list()), “xi”), pstream__));
^
pbpk/pbpkauto.hpp(518): note: this candidate was rejected because arguments do not match
stan::math::assign(y_hat_ind, integrate_ode_bdf(pbpk_functor__(), y0, 0.0, static_cast<std::vector<local_scalar_t__> >(stan::math::array_builder<local_scalar_t__ >().add(0.25).add(1).add(0.75).add(1.0).add(1.5).add(2.0).add(3.0).add(4.0).add(6.0).add(8.0).add(12.0).add(18.0).add(24.0).add(30.0).add(36.0).array()), theta_ind, stan::model::rvalue(xs, stan::model::cons_list(stan::model::index_min_max(1, 1), stan::model::nil_index_list()), “xs”), stan::model::rvalue(xi, stan::model::cons_list(stan::model::index_min_max(1, 1), stan::model::nil_index_list()), “xi”), pstream__));
^
stan/lib/stan_math/stan/math/rev/mat/functor/integrate_ode_bdf.hpp(13): note: this candidate was rejected because arguments do not match
integrate_ode_bdf(const F& f, const std::vector<T_initial>& y0, double t0,

Oh, i overlooked that…but yes, please pass the ts argument via xs, this has the same problem, indeed.

System of ODEs is running.

I am trying to install cmdstan on a different cluster. After:
git clone https://github.com/stan-dev/cmdstan.git --recursive

I don’t see user-config.jam in stan/lib/stan_math/lib/boost_1.66.0 and therefore can’t set mpicxx to a correct path. Any suggestions?

when you build things the first time, the user-config.jam gets created. So do make build-mpi, then edit the .jam file, then do a make clean-all and then do again a make build.

… BTW… this user-config.jam is boost specific; more doc at boost.org (we use their boost build)

make build-mpi gives:
mpiCC -Wall -I . -isystem stan/lib/stan_math/lib/eigen_3.3.3 -isystem stan/lib/stan_math/lib/boost_1.66.0 -isystem stan/lib/stan_math/lib/sundials_3.1.0/include -std=c++1y -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -Wno-unused-function -Wno-uninitialized -I src -isystem stan/src -isystem stan/lib/stan_math/ -DFUSION_MAX_VECTOR_SIZE=12 -Wno-unused-local-typedefs -DEIGEN_NO_DEBUG -DNO_FPRINTF_OUTPUT -pipe -c -c -O0 -o stan/lib/stan_math/bin/math/prim/arr/functor/mpi_cluster_inst.o -fPIC -DSTAN_MPI stan/lib/stan_math/stan/math/prim/arr/functor/mpi_cluster_inst.cpp
icpc: command line warning #10159: invalid argument for option ‘-std’
icpc: command line warning #10006: ignoring unknown option ‘-Wno-unused-local-typedefs’
/usr/include/c++/4.4.7/c++0x_warning.h(31): catastrophic error: #error directive: This file requires compiler and library support for the upcoming ISO C++ standard, C++0x. This support is currently experimental, and must be enabled with the -std=c++0x or -std=gnu++0x compiler options.
#error This file requires compiler and library support for the upcoming
^

compilation aborted for stan/lib/stan_math/stan/math/prim/arr/functor/mpi_cluster_inst.cpp (code 4)
make: *** [stan/lib/stan_math/bin/math/prim/arr/functor/mpi_cluster_inst.o] Error 4

It seems openMPI 1.6.3 causes a problem.

ODEs with numerical integration run very slowly even with MPI. On Win 10 without MPI but with jacobian provided it took 22h to completion. On Linux with MPI support and 8 cores (8 subjects) only 10% was completed in 5h. While MPI introduces significant boost to regular stuff, solution time for system of 30+ ODEs numerically is so significant that it offsets the benefits provided by MPI.

It would be awesome if you could share some pointers how to provide Jy and Jtheta. I am writing a paper which involves comparison of MCMC (stan) of system of ODEs, home grown approximated algorithm for the same system of ODEs, and MCMC of approximation of system of ODE by nonlinear algebraic equations using orthogonal collocation. So far I have code to provide Jy/Jtheta for system of ODEs in cmdstan 2.17. Is it possible to get help how to provide Jy/Jtheta for system of nonlinear algebraic equations in cmdstan 2.17; even better in cmdstan 2.18 for both system of ODEs and nonlinear algebraic equations where I could use MPI?

The other cluster it seems is bit outdated and we have to upgrade MPI on our own before trying to build cmdstan.

Hi!

I think your intel compiler is giving you problems. You could try to find out with mpiCC --show-me (or similar syntax for Intel) what you exactly need to compile and link against your installed Intel MPI installation. Then pass this to a more modern compiler which you hopefully have on the system and is hopefully ABI compatible to the MPI installation (it probably is compatible)… or use threading on the cluster (again with a newer compiler). Intel compilers are not well supported by stan at the moment.

It is probably true that these large ODE systems gain a lot from an analytical Jacobian. Hopefully I find the time to get the old code working again in the new 2.18 ODE code. These large ODE systems are really a numerical nightmare given how we solve it. Adjoint integration would be much, much better for this!

Thanks in advance for finding the time. I thought that somewhere stan calls Jy/Jtheta which in case Jacobians are not provided (default) does the numerical integration. By overloading this code one can substitute with the code to analytically calculate Jy/Jtheta.

I wonder if ncp formulation would improve the situation, i.e. instead of
lp=normal_lpdf(theta|mu,sigma)
use
lp=normal_lpdf(theta|0,1)
theta_new=mu+theta*sigma

It took 43h to solve with numerical Jacobian. I case you need a code for cmdstan 2.17.1 it is attached.
pbpkauto_jacobianCmdStan.txt (2.8 KB)

I still don’t understand if finite differences are used when Jacobian is not provided. Somewhere I saw that stan AD is used to get the Jacobian.

Hi,

I tried to follow the setup for cmdStan with MPI outlined here and ran into a few problems.

I did:

  • download and extract the cmdStan 2.18 release from https://github.com/stan-dev/cmdstan/releases

  • get current make folder from development git

  • under cmdstan/stand/make created a local file and added

    • STAN_MPI=true
      CC=mpicxx
  • make build-mpi

    • (including the user-config.jam, where I changed the compiler to mpicxx, because Intel)
  • make clean-all

  • make build -j16

  • make [path to model.stan]

  • ran the excutable with sample and data file provided from mentioned thread

    • ./examples/mpi/model sample data file= …

    • it’s hard to say if it working, since I get no output about the number of threads used

    • getting warings

      • Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
        Exception: Exception: normal_lpdf: Scale parameter is 0, but must be > 0! (in ‘examples/mpi/model.stan’ at line 13)
        (in ‘examples/mpi/model.stan’ at line 57)
  • if I run the same command with mpirun -n 3 in front it just looks as if the programm ist started 3 times (which would be correct), but the instances are not communcating

    • time is the same as without mpi run

Sooo, I have no Idea what I’m doing wrong, or even to check what is going wrong

Edit:

  • fixed make clean