I have implemented a model using no within-change parallelism,
map_rect(), and now
reduce_sum() to profile the speedup vs. resource consumption tradeoff of each approach.
All three models run. When
map_rect() runs with one chain, it uses 8 cores (800% in
reduce_sum() runs with one chain, it uses only 1 core (100% in
make/local looks like this and I have done
make clean-all and
make build since making it.
CXXFLAGS += -DSTAN_THREADS CXXFLAGS += -pthread STAN_THREADS=true
reduce_sum() demo (redCards) works and results with
top reporting CPU% > 100.
Is this just my
grainsize=1 determining that I don’t need to within-chain parallelize it with my data scale? I increased my data scale 10-fold and it still doesn’t broadcast it.
I have also noticed that
reduce_sum() seems to only slice its first argument. I have 3 arguments that would benefit from slicing, one of which is a massive feature matrix. Currently it is being passed as one of the
s arguments to
reduce_sum() instead of as the sliced argument, because if I pass it as
matrix[,] then matrix multiplication with it fails (“Ill-typed arguments for *. … matrix[,], vector”).