Yeah that is kinda expected in this case. Since reduce_sum
only slices the first argument, theta
isn’t getting sliced and the entire (5000x3) matrix gets copied to each thread/process. This copy time can outweigh the performance benefit of splitting the loop between processes.
At least that’s how I understand things, @wds15 does that sound right?