That would mean adding a ViennaCL dependency, at least for the time being. How does the Stan library plus ViennaCL get built? I’m OK with expecting users to install OpenCL on their own. It’d be nice if we had a diagnostic they could use to check that install.
I see two immediate options that would make sense:
add a _gpu version of the function rather than branching in the regular function (or add a flag as a second argument that defaults to false)
put a runtime test right up front in the function that if the size is less than 100 x 100 to call the Eigen version
The second approach doesn’t pollute the language and puts the burden on the math lib, not Stan, so it seems a lot simpler.
Are you thinking that Stan + ViennaCL would get built differently from how it’s currently built? I’m not sure I see why it would need to change much, outside of adding the appropriate include directives, etc.
I was arguing against both of those options, above, in favor of a flag to the Stan compiler. But if we’re happy with a simple runtime check then that does seem easier, assuming we can also figure out if OpenCL is installed correctly. This will make it a bit harder for users to debug if they were expecting GPU support but not getting it.
When you say “Stan compiler,” do you mean the C++ compiler?
I think Bob’s first suggestion (adding a _gpu version) makes it the easiest on both the developer and the user.
The second suggestion is what I was originally thinking too, but I realized that size may depend on the actual hardware available. There’s probably a good cutoff somewhere, but right now, I think it’s fair to let the user choose.
I meant whatever stanc and the interfaces call to transpile a model.stan to a C++ file, but that flag would likely just insert a #define directive for the C++ compiler.
This seems reasonable, but leads to dirtying up the language with things that are supposed to be black-box from a language perspective. It also makes it a little less clean for us to figure out when models should be linked against OpenCL and ViennaCL sources included at compile time, though we could just default to that from now on.
The 2nd one solves that last issue by just always requiring them if OpenCL is available, which might increase compile times a bit(?). The penalty for using a GPU for small matrices is minimal given the pre-reqs for finding ourselves in that position - that someone has OpenCL installed correctly, that they have a GPU capable of double precision math, and that they’re using Cholesky decompose in the first place. The double-precision pre-req rules out most (all?) laptops right away, so someone in this position has already decided to move this computation to a more powerful server or desktop computer. Agreed that you could still end up with some users who experience a doubling in sampling time for problems where they’re using cholesky_decompose for small matrices.
I think an explicit flag gives users just enough choice here, leaves implementation details out of the language, and allows us to add headers and linked libraries only when requested to. Do you guys think this would be difficult to implement? I’m imagining it being fairly simple (though not as simple as the above two options assuming we just always include ViennaCL now).
If the case where the user has to do the cholesky of a smaller matrix and a larger matrix in the same stan program is an edge case of little importance, then I lean towards having one function. As @seantalts said, this does help maintain the ‘black-box’-ness of stan.
If you would like me to continue being the front(ish) man for this than I would be more than happy to! However, it does seem that at this point a lot of this comes down to design choices and how we compile / link things. This is a bit out of my realm, but something I need more experience in doing so I would be happy to do it.
So here’s the question, are there other parts of stan that the team would like to have a gpu implementation of? If yes, then it makes sense for me to continue with @seantalts holding my hand a bit. Then for the next thing I’ll need less hand holding, etc etc.
I suppose a lot of this also hinges on what Rok is doing with his implementation. @rok_cesnovar would you have time to chat sometime this week or next?
Also, apologies for being MIA for a bit and thank you @seantalts for cleaning up the GPU PR. Graduation went off without a hickup and the R in finance conference went well!
Regarding the language, @Bob_Carpenter has final say. My opinion is to have a _gpu function. I’m thinking about it this way:
I’m thinking about the language being coherent beyond just this feature. If you’ve been following @wds15’s work on MPI and parallelizing ODEs across patients, we’ve been discussing adding a parallel for construct. I don’t think it’s going to be automatic in that case to infer when and when we can’t use parallelism, so the user will have to specify it. I kind of see this as a similar burden on the user.
Our advanced users have been wanting more control, not more abstraction.
If it automagically compiles, we now have the burden to tell the user when the GPU is being used (at the time it runs), and when it doesn’t.
The build process is going to be easier (more below on that).
Going the other way changes the paradigm of how we generate code. Currently, the code generation just happens and there aren’t many flags to turn features on / off at the C++ level. This would be a shift. Not necessarily bad, just a shift.
If the overall goal is to have a single log probability evaluation run on a GPU without the user having to specify it, then I’d change my stance and I’d be happy if we just specified the operation in the language and the implementation is chosen based on compiler flags. If we don’t think that’s a realistic goal and we’re only going to have a few functions that can run on a GPU or MPI or whatever, then my opinion would be to have new implementations of those functions that are specific.
Now that I think about it, I don’t see a reason why we need to generate different code even if you wanted to switch. All we would have to do is define a C directive at the compilation step. For example, we could add to the CFLAGS something like -D STAN_GPU. That’s equivalent to #define STAN_GPU from within a header file (https://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/Preprocessor-Options.html). I don’t think it’ll be hard, but I think the build process should be handled out at the interface (CmdStan) level, not at math or Stan.
Here’s what happens with automatically selecting GPU / CPU version:
With a CPU only compiler. Program has cholesky_decompose().
Compiles with the CPU compiler. I think we really ought to tell the user when running that it’s only using the CPU and not the GPU even though it can use the GPU because some users will think it’s running on the GPU.
GPU enabled compiler. Program has cholesky_decompose().
Compiles with the GPU enabled compiler. If the link library isn’t there, it’ll fail. If it’s there, it’ll run. I think we should tell the user that it’s running with GPUs because there will be plenty of ways for the user to have messed this up.
If we had the _gpu() function, then we’re looking at these four conditions:
With a CPU only compiler. Program has cholesky_decompose().
Compiles with CPU compiler. No problems.
With a CPU only compiler. Program has cholesky_decompose_gpu().
Can’t compile the code. The user is told right away that they’re not getting the GPU.
With a GPU enabled compiler. Program has cholesky_decompose().
Compiles with the GPU enabled compiler with the linker arguments. No issues – it just runs the CPU version.
With a GPU enabled compiler. Program has cholesky_decompose_gpu().
Compiles with the GPU enabled compiler with the linker arguments. No issues – it runs the GPU version.
That’s easy enough with CmdStan, but I think it gets much trickier with RStan and PyStan. And failing early here is useful, I think.
If we want this to also get to the user (under how we’re distributing code now), we also have to think about size of distribution.
RStan and PyStan both trim the libraries. We’d want to make sure we can compile without the ViennaCL includes in the math library, if possible. That’s how we deal with Sundials. (Maybe we can rethink that too, but that’s for a different topic.)
Sure, we can chat whenever you have time. We can do it Friday or anytime next week. Keep in mind that I live in eastern time -6 :)
I am playing around with a few different implementations I have made, I will probably have time tomorrow to put everything together and post it here. It looks promising !
I had some issues with ViennaCL, but mostly due to our misunderstanding of their API. For instance the code
vcl_L = viennacl::trans(vcl_L) is problematic as their function is not meant for inplace transposing. This creates a race condition. It seems to execute well on low-end GPUs where the number of threads that actually execute in parallel is small, the problem escalated when I tried the GTX1070.
The same problem occurs on the matrix prod and with your fix of my custom kernel. I made it inplace because it is not a bottleneck. Your code is faster but it should not be inplace.
Will post results of my different attempts tomorrow.
This is a good point. This might be a bit niggly, but I think what people are specifying is that the operation is safe to perform in parallel; i.e. input is required from the user to know that the code does not have have data interdependencies between iterations. In an ideal world, I think the compiler would figure this out, but this is a very tricky problem and compilers need to be conservative in the face of uncertainty.
In the case of GPU functions, there in no correctness at stake - just some differences in performance.
Advanced users always want the abstraction to leak :P Our job is to navigate that tradeoff and pick something we think makes sense for most of our users.
I think a flag deals with 3, and for 5 I think we could do something like you mentioned below, adding -D STAN_GPU. A flag also makes the error handling of the API simpler, equivalent to the _gpu() case you write about. For whatever that’s worth. Agreed that failing early is pretty much always preferred.
@Bob_Carpenter, what do you think about this _gpu() thing vs some kind of -D STAN_GPU compiler flag? My perspective seems to lean relatively heavily towards simple language+complicated compiler compared with Dan and maybe with you as well, but it’s up to you.
Sorry, discourse somehow totally omitted your reply from ~20 days ago when I replied to Dan yesterday. It only appeared this morning, and then between my reply and Dan’s. Pretty weird. Anyway, I was mostly saying I’d be willing to take over just in case you had started work and found that to be time consuming (something that would be totally understandable!). I’m happy to be your point person and help guide you through the remaining issues, though as you can see some of the issues require some community discussion and decision-making.
I think we’d love to get GPU support integrated as much as we can while retaining existing performance and behavior for users who don’t have GPUs. Though it sounds like from a project management perspective it’s probably good to talk to Rok and see if their plans will totally eclipse this at some point within a short timeframe, and if you still want to put the effort into code that might get replaced at some point. I think certainly some of our users would find incremental advances in performance fairly valuable.
I liked the solution I posted to another thread on this somewhere, which involves
top-level compiler flag to turn on GPUs
branch inside each GPU-enabled function sending it to the regular version if it’s below the speedup size threshold (e.g., 100 or whatever for Cholesky, but we may want that a bit higher to default to not hammering the GPU for breakeven).
Then the language doesn’t change. I’m also OK with the _gpu solution for a second version which explicitly calls GPUs, but I lean toward putting the emphasis on our automatically figuring out what to do for users when we can. We can’t do that with ODE solvers, so we have the control params, but that’s only because we don’t know otherwise how to solve them and there may be very well different ODEs with different scales in the same model.
Awesome. I think we’re decided then. We’ll have a top-level compiler option at the interface level. This will conditionally bring in the GPU code. We’ll need to figure out how to package up the GPU code, but that should be fairly easy.
So it looks like the existing code will basically work (perhaps add back in the dynamic size() > 500 check) - Looks like the next thing to do is to wrap all of the GPU specific stuff (especially including ViennaCL) in #ifdef STAN_GPU or whatever, and then to figure out the right place to insert that in CmdStan and the interfaces. @Bob_Carpenter for CmdStan do you think we should just have people add the CFLAGS = $(CFLAGS) -D STAN_GPU to their make/local file? Anyone know how the interfaces would like to interface with this, maybe @bgoodri?
I’d be inclined to add a makefile variable. Something like STAN_GPU=true.
The reason I’m suggesting that is that once that variable is not empty, we now know that we need to add -D STAN_GPU to CFLAGS, but we’ll also need to add -isystem <gpu library> and we’ll have to add a to the link library as well. The build process is more involved than just adding the C preprocessor directive.
That seems fine for CmdStan, which is built mostly with makefiles, but I don’t know how RStan / PyStan would work with that? I suppose they also use makefiles under the hood somewhere, right? If that all works nicely then it does seem a bit better to do it that way in case adding the include files and link library incurs overhead.
@bgoodri Does RStan have a makefile somewhere that’s used to compile models? Trying to scope out Daniel’s idea for a makefile variable instead of -D STAN_GPU for RStan (to handle adding the clang flag that includes -l OpenCL and ViennaCL).