Help speeding up bernoulli Gaussian process model

Drats those are just integrated graphics, there goes that idea. I would still recommend trying the glm function without a GPU, the more efficient construction might outperform due to the copy-costs involved in reduce_sum.

If you have access to a local system with a discrete GPU it could also be a good idea to try it there as well, since the GPU processing represents a greater level of parallelism than you can achieve through reduce_sum (mostly kind of).

The lesson here is when PSSC asks ‘Are you sure you don’t need a GPU?’, you always take the GPU…

For other arguments in a reduce_sum, probably good to pass them by c++ like reference?

Unfortunately that’s not possible. Parameters in Stan (called var in c++) have a value and adjoint stored. These have to be accessed and updated as part of the auto-differentiation process. If these are being accessed by multiple threads then a race condition is introduced, as the adjoint for a given var will differ depending on how many threads have finished accumulating adjoints. To avoid this each thread works with its own copy of the parameter, so that the resulting adjoints are not affected by the other threads.

This is a bit of a rough explanation, @wds15 anything there that should be corrected?

What you write is correct for the shared parameters. These must be copied by thread (note that data is not copied). However, the sliced over variables are not being copied, since there are never two threads which work on the same sliced variables. This is why it so a lot more efficient to store things in the sliced variable if these vary by item you reduce over.

Also note that with 2.26.0 the overhead due to the shared arguments was drastically reduced. Before 2.26.0 we made a copy for each partial sum being formed while now we do copies of the shared arguments per thread only (and a given thread can work on multiple partials).

There is a slide in my stancon 2020 contribution on this.

1 Like