openMP for probability distributions

As I mentioned this morning, I have a WIP branch

Most of the affected files are under stan/math/prim/scal/prob but there are a few under stan/math/prim/mat/prob . The first commit just adds

#ifdef _OPENMP
  #include <omp.h>
#endif

to a bunch of .hpp files, so essentially nothing happens (there is a compiler warning, which we could suppress) if you do not have

CXXFLAGS += -fopenmp`

in your make/local file, which causes _OPENMP to be defined using every modern compiler except Xcode’s clang and Solaris’ g++. You can enable it on a Mac with a genuine clang, such as
http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/
To see if it works, do

./runTests.py test/unit/openMP_test.cpp

The remaining commits most put things like

#pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
  default(none) shared(something, else)

on the line before a for loop that does not early return. The if clause causes it to run serially when the condition fails. The default(none) thing tells it to not make any symbol available inside the loop, except what is specifically listed in shared(...).

Some pragmas also have things like

reduction(+ : logp)
reduction(* : cdf)

which tells openMP that it should make a local copy of logp for each thread and combine them when the loop is finished using the + operator, which seems to be working except in mix tests.

To reproduce the segfault, it is sufficient to do

./runTests.py test/unit/math/mix/scal/prob/normal_test.cpp

after eliminating the comment characters on lines 90 and 94 of stan/math/prim/scal/prob/normal_lpdf.hpp . The backtrace is

[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from ProbAgradDistributionsNormal
[ RUN      ] ProbAgradDistributionsNormal.derivatives
[       OK ] ProbAgradDistributionsNormal.derivatives (1 ms)
[ RUN      ] ProbAgradDistributionsNormal.FvarVar_1stDeriv

Program received signal SIGSEGV, Segmentation fault.
0x000000000040a4d8 in stan::math::(anonymous namespace)::add_vd_vari::add_vd_vari(stan::math::vari*, double) ()
(gdb) bt
#0  0x000000000040a4d8 in stan::math::(anonymous namespace)::add_vd_vari::add_vd_vari(stan::math::vari*, double) ()
#1  0x000000000042ada4 in stan::math::var::operator+=(double) ()
#2  0x000000000040a162 in .omp_outlined. ()
#3  0x000000000042aa05 in stan::return_type<stan::math::fvar<stan::math::var>, double, double, double, double, double>::type stan::math::normal_lpdf<false, stan::math::fvar<stan::math::var>, double, double>(stan::math::fvar<stan::math::var> const&, double const&, double const&) ()
#4  0x000000000042a789 in stan::return_type<stan::math::fvar<stan::math::var>, double, double, double, double, double>::type stan::math::normal_lpdf<stan::math::fvar<stan::math::var>, double, double>(stan::math::fvar<stan::math::var> const&, double const&, double const&) ()
#5  0x000000000040bac9 in stan::return_type<stan::math::fvar<stan::math::var>, double, double, double, double, double>::type stan::math::normal_log<stan::math::fvar<stan::math::var>, double, double>(stan::math::fvar<stan::math::var> const&, double const&, double const&) ()
#6  0x00000000004082a5 in ProbAgradDistributionsNormal_FvarVar_1stDeriv_Test::TestBody() ()
#7  0x0000000000444262 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#8  0x0000000000432e79 in testing::Test::Run() ()
#9  0x00000000004339b5 in testing::TestInfo::Run() ()
#10 0x0000000000433f14 in testing::TestCase::Run() ()
#11 0x000000000043946d in testing::internal::UnitTestImpl::RunAllTests() ()
#12 0x0000000000444abf in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#13 0x0000000000439190 in testing::UnitTest::Run() ()
#14 0x000000000042c398 in main ()

My understanding is that fvar-var case that logp should be a fvar<var> and it should use the + operator from stan/math/fwd/core/operator_addition.hpp, which it manages not to do even though that header is included (which can be seen by adding -H to CXXFLAGS in the make/local file).

Otherwise, things seem to be fine in the sense that all of test/unit passes with or without openMP enabled, but I can’t really tell because test/prob will segfault for the same reasons. But there are a couple of changes to be aware where there could be mistakes.

  1. Many of the _lccdf and _lcdf functions check for limiting values in two different loops and early return. Since openMP does not allow early returns (although continue is apparently fine), I often had to move the second check into the same loop as the first check (which is executed serially) so that the second loop could be run in parallel.
  2. In a few cases, we were looping twice to create two VectorBuilders even though the template parameters being passed to them were the same in both cases.
  3. The stan/math/prim/scal/prob/beta_binomial_lcdf.hpp file has one openMP pragma commented out because a test fails with it in due to something with F32 so maybe Stan Math is not totally thread-safe.

There are probably other cases where it is safe (no autodiff, no exceptions, no early return, etc.) to use openMP to parallelize things, but I figured it would be best to start with the probability distributions.

1 Like

This sort of error screams that the template instantiation is being done in the wrong order. That will cause the wrong template instantiation to be used because it’s instantiated before the one you want is instantiated. I want to either confirm that or rule it out.

Can you post the exact compiler invocation used? You should be able to do:
> make -n test/unit/math/mix/scal/prob/normal

I don’t know why and how you’re using the -H option. If I’m not mistaken, that controls the preprocessor and injecting headers… we really shouldn’t need to do that. And if you’re doing that, then it might be causing this error. But… let’s figure out if that’s right first.

Do any other tests work?

make -B test/unit/math/mix/scal/prob/normal_test will produce a binary (with clang) that segfaults. With g++ it actually won’t even compile,

goodrich@T540p:/opt/stan/lib/stan_math_2.7.0$ make -B test/unit/math/mix/scal/prob/normal_test
g++-7 -Wall -I . -isystem lib/eigen_3.3.3 -isystem lib/boost_1.64.0 -isystem lib/cvodes_2.9.0/include -std=c++1y -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -Wno-unused-function -Wno-uninitialized -fopenmp -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -Og -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -Og -DNO_FPRINTF_OUTPUT -pipe  -c -o test/unit/math/mix/scal/prob/normal_test.o test/unit/math/mix/scal/prob/normal_test.cpp
g++-7 -Wall -I . -isystem lib/eigen_3.3.3 -isystem lib/boost_1.64.0 -isystem lib/cvodes_2.9.0/include -std=c++1y -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -Wno-unused-function -Wno-uninitialized -fopenmp -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -Og -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -Og -DNO_FPRINTF_OUTPUT -pipe  -c -o lib/gtest_1.7.0/src/gtest-all.o lib/gtest_1.7.0/src/gtest-all.cc
In file included from ./stan/math/prim/scal/prob/normal_log.hpp:5:0,
                 from ./stan/math/prim/scal.hpp:335,
                 from ./stan/math/mix/scal.hpp:18,
                 from test/unit/math/mix/scal/prob/normal_test.cpp:1:
./stan/math/prim/scal/prob/normal_lpdf.hpp: In instantiation of ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = stan::math::var; T_loc = stan::math::var; T_scale = stan::math::var; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::var]’:
./stan/math/prim/scal/prob/normal_log.hpp:34:50:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_log(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = stan::math::var; T_loc = stan::math::var; T_scale = stan::math::var; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::var]’
test/unit/math/mix/scal/prob/normal_test.cpp:18:56:   required from here
./stan/math/prim/scal/prob/normal_lpdf.hpp:82:46: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (length(sigma) > 3 * omp_get_max_threads()) \
                                ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
                                ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In instantiation of ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = int; T_loc = int; T_scale = stan::math::fvar<double>; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<double>]’:
./stan/math/prim/scal/prob/normal_log.hpp:34:50:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_log(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = int; T_loc = int; T_scale = stan::math::fvar<double>; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<double>]’
test/unit/math/mix/scal/prob/normal_test.cpp:30:65:   required from here
./stan/math/prim/scal/prob/normal_lpdf.hpp:82:46: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (length(sigma) > 3 * omp_get_max_threads()) \
                                ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
                                ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In instantiation of ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = stan::math::fvar<stan::math::fvar<double> >; T_loc = int; T_scale = int; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::fvar<double> >]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:126:28:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with T_y = stan::math::fvar<stan::math::fvar<double> >; T_loc = int; T_scale = int; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::fvar<double> >]’
./stan/math/prim/scal/prob/normal_log.hpp:43:42:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_log(const T_y&, const T_loc&, const T_scale&) [with T_y = stan::math::fvar<stan::math::fvar<double> >; T_loc = int; T_scale = int; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::fvar<double> >]’
test/unit/math/mix/scal/prob/normal_test.cpp:35:3:   required from here
./stan/math/prim/scal/prob/normal_lpdf.hpp:82:46: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (length(sigma) > 3 * omp_get_max_threads()) \
                                ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
                                ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:11: error: user defined reduction not found for ‘logp’
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
           ^~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In instantiation of ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = stan::math::fvar<double>; T_loc = int; T_scale = int; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<double>]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:126:28:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with T_y = stan::math::fvar<double>; T_loc = int; T_scale = int; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<double>]’
./stan/math/prim/scal/prob/normal_log.hpp:43:42:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_log(const T_y&, const T_loc&, const T_scale&) [with T_y = stan::math::fvar<double>; T_loc = int; T_scale = int; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<double>]’
test/unit/math/mix/scal/prob/normal_test.cpp:36:3:   required from here
./stan/math/prim/scal/prob/normal_lpdf.hpp:82:46: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (length(sigma) > 3 * omp_get_max_threads()) \
                                ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
                                ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In instantiation of ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = stan::math::fvar<stan::math::var>; T_loc = double; T_scale = double; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:126:28:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with T_y = stan::math::fvar<stan::math::var>; T_loc = double; T_scale = double; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’
./stan/math/prim/scal/prob/normal_log.hpp:43:42:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_log(const T_y&, const T_loc&, const T_scale&) [with T_y = stan::math::fvar<stan::math::var>; T_loc = double; T_scale = double; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’
test/unit/math/mix/scal/prob/normal_test.cpp:49:44:   required from here
./stan/math/prim/scal/prob/normal_lpdf.hpp:82:46: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (length(sigma) > 3 * omp_get_max_threads()) \
                                ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
                                ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:11: error: user defined reduction not found for ‘logp’
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
           ^~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In instantiation of ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = double; T_loc = stan::math::fvar<stan::math::var>; T_scale = double; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:126:28:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with T_y = double; T_loc = stan::math::fvar<stan::math::var>; T_scale = double; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’
./stan/math/prim/scal/prob/normal_log.hpp:43:42:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_log(const T_y&, const T_loc&, const T_scale&) [with T_y = double; T_loc = stan::math::fvar<stan::math::var>; T_scale = double; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’
test/unit/math/mix/scal/prob/normal_test.cpp:65:44:   required from here
./stan/math/prim/scal/prob/normal_lpdf.hpp:82:46: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (length(sigma) > 3 * omp_get_max_threads()) \
                                ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
                                ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:11: error: user defined reduction not found for ‘logp’
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
           ^~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In instantiation of ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = double; T_loc = double; T_scale = stan::math::fvar<stan::math::var>; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:126:28:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with T_y = double; T_loc = double; T_scale = stan::math::fvar<stan::math::var>; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’
./stan/math/prim/scal/prob/normal_log.hpp:43:42:   required from ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_log(const T_y&, const T_loc&, const T_scale&) [with T_y = double; T_loc = double; T_scale = stan::math::fvar<stan::math::var>; typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type = stan::math::fvar<stan::math::var>]’
test/unit/math/mix/scal/prob/normal_test.cpp:80:44:   required from here
./stan/math/prim/scal/prob/normal_lpdf.hpp:82:46: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (length(sigma) > 3 * omp_get_max_threads()) \
                                ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
                                ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:11: error: user defined reduction not found for ‘logp’
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
           ^~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In function ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = stan::math::fvar<stan::math::fvar<double> >; T_loc = int; T_scale = int]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:105:12: error: ‘logp’ not specified in enclosing ‘parallel’
       logp += NEG_LOG_SQRT_TWO_PI;
       ~~~~~^~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:11: error: enclosing ‘parallel’
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
           ^~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In function ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = stan::math::fvar<stan::math::var>; T_loc = double; T_scale = double]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:105:12: error: ‘logp’ not specified in enclosing ‘parallel’
       logp += NEG_LOG_SQRT_TWO_PI;
       ~~~~~^~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:11: error: enclosing ‘parallel’
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
           ^~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In function ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = double; T_loc = stan::math::fvar<stan::math::var>; T_scale = double]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:105:12: error: ‘logp’ not specified in enclosing ‘parallel’
       logp += NEG_LOG_SQRT_TWO_PI;
       ~~~~~^~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:11: error: enclosing ‘parallel’
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
           ^~~
./stan/math/prim/scal/prob/normal_lpdf.hpp: In function ‘typename stan::return_type<T_y, T_scale_succ, T_scale_fail>::type stan::math::normal_lpdf(const T_y&, const T_loc&, const T_scale&) [with bool propto = false; T_y = double; T_loc = double; T_scale = stan::math::fvar<stan::math::var>]’:
./stan/math/prim/scal/prob/normal_lpdf.hpp:105:12: error: ‘logp’ not specified in enclosing ‘parallel’
       logp += NEG_LOG_SQRT_TWO_PI;
       ~~~~~^~~~~~~~~~~~~~~~~~~~~~
./stan/math/prim/scal/prob/normal_lpdf.hpp:90:11: error: enclosing ‘parallel’
   #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
           ^~~
<builtin>: recipe for target 'test/unit/math/mix/scal/prob/normal_test.o' failed
make: *** [test/unit/math/mix/scal/prob/normal_test.o] Error 1

I just did -H to verify that the headers were being included; the behavior is the same with or without it. Also, test/unit/math/prim, test/unit/math/rev, and test/unit/math/fwd all pass, so if it is an issue with include ordering it is specific to mix and not triggered by what we are currently doing on the develop branch.

@bgoodri, the line numbers in the error message don’t line up with the branch on github. Mind synching and re-generating that compiler error?

Pushed. The only difference should be now the branch uncomments the pragmas that break normal_lpdf.hpp.

1 Like

Do you have it set up on the linux box? If so, mind if I ssh into that box and try to debug that way?

I think it should be fine. You may need to email me for the IP address because the DNS got broken.

You can also replicate it on Windows with g++.

Short story is that this won’t work for fvar because it’ll induce race conditions in nested reverse-mode autodiff. Longer story follows.

The trace:

indicates it’s crashing on a var += double call. That’ll happen internally when you call the fvar<var> += double somewhere. If these calls to fvar<var> += ... happen in parallel, they’ll introduce race conditions on memory in nested manipulations of var.

Those race conditions are the reason for reduction clauses:

The operations themselves on fvar<var> will call var operations. This all happens before you get to the logp reductions.

I think it’s best to just not bother testing on forward mode for this. That’s what we’re doing for GPUs and MPI — not worrying about forward mode.

I guess we can figure out how to not run the fwd tests, but I still don’t get why this should be a problem. If there is a local logp for each thread that is a fvar<var> and that thread does local operations on it, it should not be racing with another thread’s local logp. And the autodiff expression tree at the end shouldn’t care about what order commutative operations occurred in.

The problem is that they’re all hitting our shared memory allocator for the expression graph which isn’t thread safe. What we need to do to put things back together that are done in parallel in general is what Sebastian’s doing with MPI.

The nice thing about just hitting our distributions is that none of double, var, or fvar<double> or fvar<fvar<double>> hit the autodiff memory—they just do everything as double calculations then use operands and partials. The problem with fvar<var> is that the nested operations on var aren’t thread safe.

So, if I put

#ifndef STAN_MATH_FWD_CORE_HPP
  #pragma omp parallel for if (N > 3 * omp_get_max_threads()) \
    reduction(+ : logp) default(none) shared(...)
#endif

then, it does not segfault, but I think that precludes fvar<double> and fvar<fvar<double> >. How to I only preclude fvar<var>?

You want to preclude the MIX case from being instantiated, but there’s not a simple include.

If it’s called from the outside, you should find STAN_MATH_MIX_MAT_HPP getting included. The top-level includes for clients should be the ones paralleling the above. But it won’t prevent local fiddling.

@bgoodri have u checked out options for openMP which control nesting of parallel for loops. Looks to me as if you want to make OpenMP aware of nesting and as a start simply tell OpenMP to not run any nested stuff… but maybe this type of nesting is too involved for OpenMP to deal with it (so it could even be a compiler issue).

Have you benchmarked this branch on some problems?

I haven’t even gotten all the tests to pass yet, but I’m working on that now trying to avoid using openMP on fvar<var>.

Hi Ben,

As the branch shows an optimization for each different accumulating log density distribution, we are wondering if there is any possibility we can do OpenMP on the generic autodiff process. I thinking it might avoid some extra overhead and make it a little faster. Is it hard because we are using back autodiff for stan?

Thanks

Yes. As I understand it, that would require a big rewrite of the rev/ subdirectory. It is currently only safe to use openMP on doubles within Stan Math, but that includes cases where we pull the double value out of a var and calculate the partial derivatives analytically.

Yes, we don’t know how to do general parallelization of autodiff. We are using MPI for some limited functionality through a map function. It makes a huge difference. The implementation of map deals with putting the pieces back together in the autodiff graph.