Here’s another parallel model: base.stan (9.7 KB) base.data.R (682.7 KB)
Here’s the serial version: base_serial.stan (8.5 KB)
The serial single core version takes about 35 seconds.
The single core reduce_sum version takes about 45 seconds.
A 4-core run of the reduce_sum code takes about 25 seconds (so 4 chains would take about 100 seconds).
A 4-chain run of the single core code takes about 45 seconds.
grainsize and N_subset are defined in the data file. Right now it’s grainsize = 100, N_subset = 400, so make that bigger before trying more threads.