Cmdstan 2.18 MPI

Yes, I started with threading. Since I am not an expert I compiled the following notes (I have gcc version 4.8.5). Can you please glance at them in case I made a mistake?

I did the following steps to recompile cmdstan:
make clean-all
make/local contains CXXFLAGS += -DSTAN_THREADS -pthread
cd cmdstan/make
git fetch
git checkout develop models
cd …
make build
make mpi10/foo
cd mpi10

in the job submission script I have used:
export STAN_NUM_THREADS=10
time ./foo sample …

I have set shards to 10.

On the node that was assigned to me by PBS I found only one foo process. Only 10 cores were used. Is this as expected?

I ran the test in Linear, parallell regression and the timing is almost linear (I have omitted the seconds):
2 shards 34m
4 shards 16m
5 shards 14m
10 shards 6m

I also added a separate folder for mpi version of cmdstan (eventually I have to use multiple nodes). Since I was getting
MPI auto-detection failed: unknown wrapper compiler mpic++
I have assembled the following notes to build cmdstan:
git clone https://github.com/stan-dev/cmdstan.git --recursive
make clean-all

for treading support
make/local contains
CXXFLAGS += -DSTAN_THREADS -pthread

for MPI support
make/local contains
STAN_MPI=true
CC=mpicxx
stan/lib/stan_math/lib/boostxxx/user-config.jam contains
using mpi : /apps/cent7/intel/impi/2017.1.132/bin64/mpicxx ;
instead just using mpi ;

cd cmdstan/make
git fetch
git checkout develop models
cd …
make build

I now I have submitted the job with:
time mpiexec -n 100 ./foo sample…

On each node I had 20 foo instances. Is this as expected?