This weekend I’d like to reach out to the email at the end of that presentation and ask what their status is for 2.0 to move out of beta.
I’m slowly reading through the OpenCL spec and code to see how they do this or specify how to do this. Hand-waving a guess, if you have one piece of virtual memory pointing to two different pieces of physical memory I think what they are saying is not impossible.
We are 100% on the same page, moving to 2.0 is not on our current plate. We are talking about investigating this for the future.
Here is a link to Dan in the thread about input checks talking about how stan had issues with MatrixBase
in the past. In the linked google discussion Ben made it sound like we could get around the coefficient access issue Dan mentioned though.
Intel has this “visual compute accelerator” which is an integrated CPU+GPU with 3 Xeon processors and a P580 GPU. The gpu is not crazy powerful like a volta, but since they are integrated you don’t pay a transfer cost which is rad.
The gpus are not that powerful, but the last time we did a big profiling of the gpu code I think Rok found that 70% of the time was spent doing memory transfers. If the CPU+GPU combo can dramatically reduce those transfer costs then I think we would end up with a pretty decent speedup. Back of the hand math, if the gpu process was 100ms and we reduce memory transfer costs by 95% while increasing the algorithm cost by 15% then I think we would end up with a 2.6x speedup relative to the current version. The current gpu code is about 8x faster so the gpu version with svm for 5K matrices would be 21x faster than the cpu version.
All those numbers are made up though so we would need to get our hands on one of these things to see what sort of speedups we would get.