OpenCL: WORK_PER_THREAD = 0

@rok_cesnovar

Maybe you can shed some light on this. I’m not too familiar with the OpenCL codebase, and I am trying to fix some failing tests in the monorepo. I have the following failing test in /rev/mat/fun/cholesky_decompose_test.cpp: AgradRevMatrix, mat_cholesky_1st_deriv_large_gradients. Single stepping through the code, the problem arises inside tri_inverse.hpp, when calling opencl_kernels::inv_lower_tri_multiply.make_functor.get_opts(). WORK_PER_THREAD somehow has a value of 0, and when calculating block sizes later on causes a divide by zero error. Also, a 0-valued THREAD_BLOCK_SIZE entry is also created.

I have taken a screenshot of my debugger: https://imgur.com/a/UnQAFlU. In it, you will see that the options map has two duplicate definitions of “WORK_PER_THREAD”. I don’t even know how that is possible. Is this possibly a Google Test threading issue / have you ever seen it before?

Well, I do know why this happens. get_opts is getting by pointer to “WORK_PER_THREAD”, and you have two separate pointers stored in the map. (I am guessing you instead mean to store unique strings in the map, not unique pointers to a char array). However, I can’t trace where the multiple insertion is happening.

1 Like

Yeah, your guess is right, the intention was to store unique strings in the map. Will fix with something like this asap https://stackoverflow.com/questions/4157687/using-char-as-a-key-in-stdmap
Thanks for the find!

One insertion happens in the context https://github.com/stan-dev/math/blob/develop/stan/math/opencl/opencl_context.hpp and the happens when you call the functor for each kernel that needs it https://github.com/stan-dev/math/blob/develop/stan/math/opencl/kernels/inv_lower_tri_multiply.hpp

The idea is that you supply an optimal value to the functor, but if there are some device restrictions that dont allow that to work, we backfall to default values that work an all devices. What does stan::math::opencl_context.max_thread_block_size() return, even with this pointer bug WORK_PER_THREAD should not be zero?

What compiler are you using?

I put a printf here like

  int work_per_thread
      = opencl_kernels::inv_lower_tri_multiply.make_functor.get_opts().at(
          "WORK_PER_THREAD");
  printf("Beginning of Map\n\n");
  for (auto& it : opencl_kernels::inv_lower_tri_multiply.make_functor.get_opts()) {
      printf("%s: %d \n", it.first, it.second);
    }
  printf("End of Map\n\n");

and it’s printing

Beginning of Map

ENTIRE: 2 
UPPER_TO_LOWER: 0 
LOWER: 0 
LOWER_TO_UPPER: 1 
UPPER: 1 
THREAD_BLOCK_SIZE: 32 
WORK_PER_THREAD: 8 
LOCAL_SIZE_: 64 
REDUCTION_STEP_SIZE: 4 
End of Map

for

./runTests.py test/unit/math/opencl/tri_inverse_test.cpp

Does what you are seeing only happen in the cholesky rev test?

Nonetheless I agree with your analysis. It looks like when we initialize the kernel functor the pointer to the char is being compared and not the non-null elements in the char array.

I made a branch here that updates the map to use a string instead of a char*

Also yes thank you for the find!

@alashworth can you check out that branch to see if that fixes the bug?

stan::math::opencl_context.max_thread_block_size() is 1024.
This is Clang 8.0.1 x86_64-w64-windows-gnu.

Great, sounds like a fix is incoming. I actually haven’t seen if it appears for other tests besides cholesky-rev - I’m trying to finish up my monorepo conversion ASAP and put it up for comment this week. Need to make sure any bugs in the code belong to Stan and not my build configuration files, haha.

1 Like

Sorry, I can’t easily pull branches without rebasing my entire repository. I am building against https://github.com/alashworth/stan-monorepo.