I have implemented a model using no within-change parallelism, map_rect(), and now reduce_sum() to profile the speedup vs. resource consumption tradeoff of each approach.
All three models run. When map_rect() runs with one chain, it uses 8 cores (800% in top). When reduce_sum() runs with one chain, it uses only 1 core (100% in top).
My make/local looks like this and I have done make clean-all and make build since making it.
CXXFLAGS += -DSTAN_THREADS
CXXFLAGS += -pthread
STAN_THREADS=true
The reduce_sum() demo (redCards) works and results with top reporting CPU% > 100.
Is this just my grainsize=1 determining that I don’t need to within-chain parallelize it with my data scale? I increased my data scale 10-fold and it still doesn’t broadcast it.
I have also noticed that reduce_sum() seems to only slice its first argument. I have 3 arguments that would benefit from slicing, one of which is a massive feature matrix. Currently it is being passed as one of the s arguments to reduce_sum() instead of as the sliced argument, because if I pass it as matrix[,] then matrix multiplication with it fails (“Ill-typed arguments for *. … matrix[,], vector”).