Reduce_sum() using only one thread

I have implemented a model using no within-change parallelism, map_rect(), and now reduce_sum() to profile the speedup vs. resource consumption tradeoff of each approach.

All three models run. When map_rect() runs with one chain, it uses 8 cores (800% in top). When reduce_sum() runs with one chain, it uses only 1 core (100% in top).

My make/local looks like this and I have done make clean-all and make build since making it.

CXXFLAGS += -DSTAN_THREADS
CXXFLAGS += -pthread
STAN_THREADS=true

The reduce_sum() demo (redCards) works and results with top reporting CPU% > 100.

Is this just my grainsize=1 determining that I don’t need to within-chain parallelize it with my data scale? I increased my data scale 10-fold and it still doesn’t broadcast it.

I have also noticed that reduce_sum() seems to only slice its first argument. I have 3 arguments that would benefit from slicing, one of which is a massive feature matrix. Currently it is being passed as one of the s arguments to reduce_sum() instead of as the sliced argument, because if I pass it as matrix[,] then matrix multiplication with it fails (“Ill-typed arguments for *. … matrix[,], vector”).

if you get our examples to work and scale over CPUs, then things should work… maybe post your model.

you can only slice the first argument. In case you need more things to slice, then you have to pack that into the first argument accordingly. However, note that arguments which are data do not really need to be sliced as data is passed around as reference, so no copying.

Nevermind, 100% my fault. I was passing a different Int instead of grainsize.