I have implemented a model using no within-change parallelism, map_rect()
, and now reduce_sum()
to profile the speedup vs. resource consumption tradeoff of each approach.
All three models run. When map_rect()
runs with one chain, it uses 8 cores (800% in top
). When reduce_sum()
runs with one chain, it uses only 1 core (100% in top
).
My make/local
looks like this and I have done make clean-all
and make build
since making it.
CXXFLAGS += -DSTAN_THREADS
CXXFLAGS += -pthread
STAN_THREADS=true
The reduce_sum()
demo (redCards) works and results with top
reporting CPU% > 100.
Is this just my grainsize=1
determining that I don’t need to within-chain parallelize it with my data scale? I increased my data scale 10-fold and it still doesn’t broadcast it.
I have also noticed that reduce_sum()
seems to only slice its first argument. I have 3 arguments that would benefit from slicing, one of which is a massive feature matrix. Currently it is being passed as one of the s
arguments to reduce_sum()
instead of as the sliced argument, because if I pass it as matrix[,]
then matrix multiplication with it fails (“Ill-typed arguments for *. … matrix[,], vector”).