The Math Library is not passing tests on Windows. This is the next thing I'm working on

First off, most of Stan and Math will work in Windows. I’m going to guess Windows users aren’t pushing into the areas where it’s not building, but we should fix it.

Second, this is a call for help. I don’t know how to fix everything on Windows. If you have some time and expertise to help, please do.

Some resources:

Based on the last failure, it looks like there was a segfault somewhere.

If anyone wants to help, please respond here. @rok_cesnovar, thanks for the help so far!

1 Like

I am offering my help. I propose we start by making a list of tests that are not building on Windows to identify all the issues. I am going to post it on the issue linked above.

1 Like

Hi, what is not compiling and with what?

  • MSVC 2017
  • MinGW32
  • MinGW-w64
  • Clang (MSVC 2017 libs)
  • Clang (MinGW-w64 libs)
  • Clang (libc++)
1 Like

Thanks, @rok_cesnovar. Please feel free to update the top-level issue to reflect that. (If you don’t have permissions, I can try to trace through to give you permissions).

@ahartikainen, we’re only testing with one compiler: RTools g++. I don’t know which one that is on your list. And… whoops. I think it’s all building now with @rok_cesnovar’s help. I think it’s not passing tests. (I’m going to update the thread title to reflect that.)

1 Like

https://cran.r-project.org/bin/windows/Rtools/

  • MinGW-w64
1 Like

Thanks for the clarification.

I’m running all the tests now. I’ll report back with the first failure.

I ran all the test locally and I can see that /prim, /mix, /fwd, /gpu, /memory are all fine.

There are issues with 2 matrix_exp_multiply tests in /rev, specifically scale_matrix_exp_multiply and matrix_exp_multiply

In the latter, the tests fail (seg fault) for N > 1 here: https://github.com/stan-dev/math/blob/develop/test/unit/math/rev/mat/fun/matrix_exp_multiply_test.cpp#L62

The seg fault occurs here or if I comment out the first .grad() call, it seg faults in the next grad() call 2 lines below.

I have gotten a bit further, but no solution yet.

The segfault occurs in this call of the run function on the Eigen::matrix_exp_computeUV.

This is called here in the matrix_exp function that is called in the /rev/matrix_exp_multiply function here

I am not really sure what this does, maybe @yizhang (I think he is the main author of this code, hope I am not wrong) can give us some insight or any ideas why this could fail (on Windows).

It seems that matrix_exp_computeUV::run is having problems with stan::math::var Eigen matrices on Windows. See https://github.com/stan-dev/math/pull/1046

Let me know if you need help debugging the C++ for the errors you find. We need to get Windows and g++7 sorted out for some math functions.

We are currently waiting the gtest pull request to go through.

The only other thing left is debugging why matrix_exp_computeUV::run seg faults in the test here.

The “trace”:

The seg fault occurs here

matrix_exp_pade is called here

matrix_exp is called here

If anyone has any idea it would be very helpful. I am unable to pinpoint the reason further. All the other test build and pass succesfully.

I’m digging and seeing some craziness. My best guess at this point is that there’s something that’s using uninitialized memory and for some reason, the Windows run is picking up on it.

The craziness… I’m seeing different autodiff stack sizes on Mac and Windows. I don’t see it failing where you see it, @rok_cesnovar. I’m seeing it in the multiple grad calls with set_zero_adjoint() used to clear it out. I don’t think that’s the right pattern for this type of test, but I’ll have to dig further.

Anyway, the craziness:

  • On Windows. Stacksize with the var x double test: 2821.
  • On Mac, same test: 64.

This is just an update. I’ll need to dig further. @yizhang, mind helping out? Am I missing something?

Let me take a look.

Yikes. It’s hard to imagine what could go wrong on Windows with stack size. These stack-size tests have proven to be super useful, so thanks for including them in the first place!

Quick update. I’ve been banging my head on this on and off. I think there’s some unsafe memory handling in the matrix_exp() function. I’ve gotten seg faults on my Mac too. This is something we need to fix.

@Bob_Carpenter has suggested using the adjoint vector product to deal with the memory. I think that’s a great idea.

@yizhang, are you around to help out with this?

@charlesm93 is the one who wrote the matrix_exp implementation, but anyone can fix it.

Did you isolate a specific place where you think there’s unsafe memory handling?

It’s a heisenbug, which is telling me that it’s in the memory handling somewhere. And I’ve been able to get it to fail sporadically on Mac too (multiple machines), so I’m convinced it’s a problem.

I’ve been trying to hunt this down on a branch: feature/daniel-windows. That branches off the fix for Windows branch. I’m leaving that branch alone until I know what’s going on here.

I got lost in all the indirection. I don’t see any definition of matrix_exp() other than in stan/math/prim. Do you think the bug is in Eigen?

Sorry – my bad. I wasn’t careful.

The problem is in:
stan/math/rev/mat/fun/matrix_exp_multiply.hpp.

Shit… I haven’t considered that it could be a bug in Eigen. I don’t think so, but maybe.

Anyway, if you want to help debug, this is the branch I’m on:
feature/daniel-windows

This is the test I’m running:
./runTests.py test/unit/math/rev/mat/fun/matrix_exp_multiply_test.cpp

I’m debugging by print. On that branch, here’s what it’ll look like when it’s running properly:

test/unit/math/rev/mat/fun/matrix_exp_multiply_test --gtest_output="xml:test/unit/math/rev/mat/fun/matrix_exp_multiply_test.xml"
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from MathMatrix
[ RUN      ] MathMatrix.matrix_exp_multiply_vd
stack size at start: 0
stack size after Av (5, 5): 200
Av =   -0.96871   0.398827   0.241306   0.741373   0.108926
  0.888077  -0.915624  -0.373344   0.255238   0.717304
-0.0899219  -0.898862  -0.800546  -0.222652  -0.271382
  0.683227   0.827031  -0.780702  -0.104228   0.885106
 -0.996585  -0.097802   0.739617   0.235266 -0.0247717
stack size after Avec: 200
stack size after B: 200
stack size: 200
+++++++++++++++++++++++++++++++
*************************** this is the right one
stack size after res_vd: 201
--------------------------- chain
-- before setting up adjexpAB(5, 5)
-- traversing variRefexpAB_.
--   0 / 25
--   1 / 25
--   2 / 25
--   3 / 25
--   4 / 25
--   5 / 25
--   6 / 25
--   7 / 25
--   8 / 25
--   9 / 25
--   10 / 25
--   11 / 25
--   12 / 25
--   13 / 25
--   14 / 25
--   15 / 25
--   16 / 25
--   17 / 25
--   18 / 25
--   19 / 25
--   20 / 25
--   21 / 25
--   22 / 25
--   23 / 25
--   24 / 25
-- setting up adjA
-- traversing variRefA_
--------------------------- end chain
[       OK ] MathMatrix.matrix_exp_multiply_vd (0 ms)
[----------] 1 test from MathMatrix (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.
stack_alloc destructor

Right now, it’s crashing before calling chain (on Windows):

test\unit\math\rev\mat\fun\matrix_exp_multiply_test.exe --gtest_output="xml:test/unit/math/rev/mat/fun/matrix_exp_multiply_test.xml"
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from MathMatrix
[ RUN      ] MathMatrix.matrix_exp_multiply_vd
stack size at start: 0
stack size after Av (5, 5): 200
Av =  -0.599231    0.81756   -0.68865   -0.62035  -0.922178
  0.876461  -0.741508  0.0892666  -0.410077   -0.23545
 -0.109043  -0.184118   0.217994   0.652577   0.269936
 -0.903745   0.205054  -0.177526   0.547655   0.427168
 0.0142521   0.747185 -0.0967742   0.123997  -0.175085
stack size after Avec: 200
stack size after B: 200
stack size: 200
stack size after res_vd: 201
test\unit\math\rev\mat\fun\matrix_exp_multiply_test.exe --gtest_output="xml:test/unit/math/rev/mat/fun/matrix_exp_multiply_test.xml" failed
exit now (11/29/18 21:16:05 Eastern Standard Time)

Sometimes I can get it to crash on my Mac too, but at a different point… somewhere in the loop (~10/25). But only sometimes.