Help with multi-threading a simple ordinal probit model using reduce_sum

Yeah that is kinda expected in this case. Since reduce_sum only slices the first argument, theta isn’t getting sliced and the entire (5000x3) matrix gets copied to each thread/process. This copy time can outweigh the performance benefit of splitting the loop between processes.

At least that’s how I understand things, @wds15 does that sound right?

1 Like