One of the big problems with GPU speedups for us is that we are almost certaintly going to need more than single precision. And it's hard with our architecture to preload and reuse data. So we're first looking at low data flow, high compute cycle operations like Cholesky decomposition (quadratic data, cubic time).
See also the branch with working code and corresponding pull request from Steve Bronder. The pull request isn't ready to go, but we're very very keen to start GPU-enabling some of our big matrix operations:
We're attacking the parallelization of higher-level algorithms, like running multiple ODE solvers in parallel, using OpenMP for multi-threading and MPI for multi-process. Sebastian Weber has examples of this going that gives you nearly embarassingly parallel speedups. And these ODE problems are some of our harder problems.
Then, underneath the ODE solvers, we can start exploiting GPUs again, but these tend not be great GPU candidates as they involve repeated application of relatively small matrix operations.
The trick in all of this is getting the automatic differentiation synched up. And I think we can do this all asynchronously with our autodiff design and only have to block for required input values (in other words, very much the same kind of high-level design as TensorFlow uses as it's the obvious way to cut a differentiable function apart [you see this going back decades in the autodiff literature]).