When I tried to compile my model it gave bunch of errors such as (although openCL is included):
g++ -std=c++1y -pthread -Wno-sign-compare -I stan/lib/stan_math/lib/opencl_1.2.8 -O3 -I src -I stan/src -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.3 -I stan/lib/stan_math/lib/boost_1.69.0 -I stan/lib/stan_math/lib/sundials_4.1.0/include -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -DSTAN_OPENCL -DOPENCL_DEVICE_ID=0 -DOPENCL_PLATFORM_ID=0 -DCL_USE_DEPRECATED_OPENCL_1_2_APIS -D__CL_ENABLE_EXCEPTIONS -Wno-ignored-attributes -lOpenCL src/cmdstan/main.o stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_idas.a /home/eval/lmockus/cmdstan-2.20.0.gpu/cliff/cliff.o -o /home/eval/lmockus/cmdstan-2.20.0.gpu/cliff/cliff
src/cmdstan/main.o: In function cl::detail::getPlatformVersion(_cl_platform_id*)': main.cpp:(.text+0x38): undefined reference to clGetPlatformInfo’
main.cpp:(.text+0x63): undefined reference to `clGetPlatformInfo’
That means that the linker cant find the OpenCL library to link. Are you using Windows or Linux? On Linux that should work if you installed the driver normally. On windows you need to set a flag.
Windows flag: LDFLAGS_OPENCL= -L"$(CUDA_PATH)\lib\x64" -lOpenCL
I keep forgetting that we had a bug in 2.20, that we fixed a day or two after the release, but there was no hotfix release. The next release is coming in 9 days.
For the time being I would recommend cloning the latest develop (git clone --single-branch https://github.com/stan-dev/cmdstan.git --recursive). That one does require git unfortunately.
Can you share anything about the model you are trying to speed up? Thanks.
The model is in cliff.stan (3.3 KB)
It is a time series model with neural network instead of ar(1). It runs very slowly but uses matrix mult so I thought GPU might speed it up. I am also thinking about adding a threading in order to use all available cores.
There are quite a few matrix multiplications in here, so you should see some speedup here, depending on the sizes. At the moment I think you would benefit from threading more, given my quick inspection of the model. Except if the matrix multiplications are 200x200 times 200x200 or larger.
Actually it is 200x10 matrices. Refactoring into map_rect form is bit more complicated - the model is big enough already. Design matrix for each year (X.) is different for each year and the calculations have to be done year by year. Perhaps each shard should contain data for each year? I am just thinking loudly. It means that each shard should have unequal number of data points. Hopefully durable… The problem I am encountering is “trace ran beyond…” which kills sampler - I posted about it few days ago - so when it is resolved I will proceed with multithreading.