Error not systematically caught by unit test

As mentioned at the last Stan meeting, this is the error that did not quite get caught by the unit test:

line 86 of algebra_solver.hpp:

theta_(i) = var(new vari(theta_d(i), false));

If the second argument is true instead of false, the unit test sometimes passes and sometimes fails (with no informative error – usually happens when the C++ code breaks, a typical example being a segmentation fault). The non-systematic nature of the error leads me to think there is something off with memory allocation.

Here’s the vari constructor:

      vari(double x, bool stacked):
        val_(x),
        adj_(0.0) {
        if (stacked)
          ChainableStack::var_stack_.push_back(this);
        else
          ChainableStack::var_nochain_stack_.push_back(this);
      }

The code works fine when stacked = false, so no action required, but I wanted to present an error a unit test may fail to pick up on.

Yes, that’s usually either a race condition or randomness in memory.

If you think there’s a problem in our underlying memory management, we need a unit test that fails. Ideally one that’s not dependent on an algebraic solver that isn’t even in the library yet.

With stacked == true, the chain() method gets called on the vari, whereas with nonchain it doesn’t. So I’d guess the problem you’re having is in whatever chain() method you wrote for the new vari.

If I follow you correctly, I would say the chain() method I wrote doesn’t act on the vari. Rather it acts on one of the arguments of chain(), namely parms. So I suppose it makes sense to have stacked == false. I’ll dig into the autodiff paper to check if my intuition is right.

I pasted the chain method below:

        inline void chain(const F1& f1,
                          const Eigen::Matrix<double, 
                            Eigen::Dynamic, 1>& x,
                          Eigen::Matrix<var, Eigen::Dynamic, 1>& parms,
                          const std::vector<double>& dat,
                          const std::vector<int>& dat_int,
                          const Eigen::Matrix<double,
                            Eigen::Dynamic, 1>& adjTheta) {
          // find roots of the equation
          Eigen::Matrix<double, Eigen::Dynamic, 1>
            parms_val = value(parms);
          hybrj_functor_solver<F1, double, double>
            functor(f1, x, parms_val, dat, dat_int, "theta");
          Eigen::HybridNonLinearSolver<hybrj_functor_solver<F1, double, double> >
            solver(functor);
          Eigen::Matrix<double, Eigen::Dynamic, 1> theta = x;
          solver.solve(theta);

          // compute Jacobian
          Eigen::MatrixXd Jf_x = functor.get_jacobian(theta);
          stan::math::hybrj_functor_solver<F1, double, double>
            functor_parm(f1, theta, parms_val, dat, dat_int, "parms");
          Eigen::MatrixXd Jf_p = functor_parm.get_jacobian(parms_val);
          Eigen::MatrixXd Jx_p = - inverse(Jf_x) * Jf_p;
          for (int i = 0; i < adjTheta.rows(); i++)
            for (int j = 0; j < parms.rows(); j++)
              parms(j).vi_->adj_ += adjTheta(i) * Jx_p(i, j);
        }

I’m talking about the method with signature

void chain();

sitting inside vari. It doesn’t have any arguments.

Hi Bob,
The void chain() method indeed acts on vari. Specifically, on the element impl_ which is a pointer to an algebra_solver_vari_alloc object, which contains the elements {f1_, x_, parms_, dat_, dat_int_, theta_}. Does this mean I should have stacked == true? As far as I can tell this makes the unit test unstable (i.e. sometimes pass, sometimes not).

Ok, maybe I did something wrong when I wrote the chain() method, though it strikes me as fairly straightforward:

virtual void chain() {
  Eigen::Matrix<double, Eigen::Dynamic, 1>
   adjTheta(impl_->theta_.rows());

   for (int i = 0; i < impl_->theta_.rows(); i++)
     adjTheta(i) = impl_->theta_(i).vi_->adj_;

  chain(impl_->f1_, impl_->x_, impl_->parms_, impl_->dat_,
    impl_->dat_int_, adjTheta);
}

I also wrote more unit tests, and got another unstable result. Calculating the Jacobian via the implicit function theorem requires the dimension of the function output (Z) and the unknowns (X, or x_) to be the same – but puts no constraints on the dimension of the auxiliary parameters (Y or parms_). Letting [X] mean the dimension of X, the test passes when [X] = [Y] = [Z] = 2. If I set [Y] = 3, the test is unstable. With [Y] = 4, no problem.

I pinpointed the line that breaks the code:

theta(k).grad(parms_vec, g);

Hunting down what happens under the hood got messy. Why do gtests not let you know which line causes the error? I guess that only works at compilation…

I’m digging, but I figured someone might have an insight. This issue may also be related to the one previously discussed.

gtest will let you know on which line a test fails. But what you’re encountering is almost certainly a crash, not a test failure. So there’s nothing to catch. The error just crashes the whole process.

It’s almost certainly in that complex chain method you wrote. It’s probably leaking memory or accessing something out of scope.

I always try to track this kind of thing down by putting print statements in the chain method (std::cout << ... << std::endl; — the endl is important to flush). Then you can find where it gets to.

Also, try to isolated unit tests. Create a test that just calls that complicated chain() method if that’s where the problem is.

Also, you don’t ever ever ever want to do something like inverse(A) * B. Always code that as a matrix division A \ B. You can use our built-in mdivide_left.

It’s almost certainly in that complex chain method you wrote. It’s probably leaking memory or accessing something out of scope.

That sounds right. So far, I have two passing unit tests for the solver. I noticed the third test behaves (i.e. fails) differently depending on whether I run the two other tests beforehand. I’ll follow your advice and create more isolated tests focusing on chain().

Also, you don’t ever ever ever want to do something like inverse(A) * B. Always code that as a matrix division A \ B. You can use our built-in mdivide_left.

Agh, that’s true! We already discussed this for the matrix exponential… I should have known better!

I created unit tests for chain() and the function seems to work fine: https://github.com/stan-dev/math/blob/feature/dogleg/test/unit/math/rev/mat/functor/algebra_solver_chain_test.cpp

I also found the line that caused the code to break. It’s in the void grad(vari* vi) function, namely:

 for (it_t it = begin; it < end; ++it) {
   (*it)->chain();
}

(it was tricky to find: algebra_solver calls jacobian which itself calls grad, so I had to mark the Jacobian to distinguish between different calls to grad).

I’m not sure how to inspect (*it). What does it contained? It is defined as std::vector<vari*>::reverse_iterator. Reverse iterators, I gather, are defined in boost and are not Stan specific. I can print elements of it. The output looks something like 1:1, which I think corresponds to a parameter’s value and its adjoint.

The problem is that memory leaks can seem to work in one test but then fail in another based on how the memory gets reclaimed.

reverse_iterator is in the std namespace—it’s just defined by std::vector as part of the standard C++ library. It’s the iterator that iterates over the elements in reverse order (highest index down to zero). What *it will contain is a vari* — a pointer to a variable implementation. And yes, 1:1 is the value and adjoint.

If I comment out the following line, the code no longer breaks (the unit test however fails, but this is expected; how exactly this links to the line that breaks, I am unclear on). In the void chain (const F1& f1, etc...) method, under // compute Jacobian:

Eigen::MatrixXd Jf_p = functor_parm.get_jacobian(parms_val);

Ok. On the other hand the following line works fine:

Eigen::MatrixXd Jf_x = functor.get_jacobian(theta);

The difference between the two structures is that functor keeps the auxiliary parameters (expectedly, the model parameters) fixed, whereas functor_parms keeps the independent variables (the unknown we want to solve for) fixed. This is set using the variable_ member, which then acts on the following operator:

  template <typename T>
  inline
  Eigen::Matrix<typename boost::math::tools::promote_args<T, T1, T2>::type,
                Eigen::Dynamic, 1>
  operator()(const Eigen::Matrix<T, Eigen::Dynamic, 1> x) const {
    if (variable_ == "theta")
      return algebraEq3(x, parms_, dat_, dat_int_);
    else
      return algebraEq3(theta_, x, dat_, dat_int_);
  }

Could the use of the Jacobian method inside the chain method create a memory leak? I’m not 100% sure how this would happen, but I understand a custom arena gets created when we call the grad function – so maybe calling grad inside of grad has some undesired side effects? Note grad gets called inside of grad, because we use the Jacobian to compute adjoints.

@charlesm93, are you using the nested methods? As an example, see how we compute the Jacobian for ODEs, https://github.com/stan-dev/math/blob/develop/stan/math/rev/mat/functor/jacobian.hpp.

@betanalpha I’m using jacobian() for every evaluation of the jacobian which itself uses the nested method. Unless I’m missing something, I don’t expect additional nesting to be necessary. I’ll look at the implementation for ODEs.

It may be that the implementation of jacobian isn’t completely nest-safe. Note the needed functions to start and end nesting, as well as the special nested function for clearing local adjoints.

@betanalpha Regarding the ODEs, I take it you’re talking about the Jacobians calculated in coupled_ode_system.hpp, lines 136 - 172? See https://github.com/stan-dev/math/blob/develop/stan/math/rev/arr/functor/coupled_ode_system.hpp.

Correct.

Not sure how significant this is, but coupled_ode_system, unlike Jacobian uses reserve() and insert() to construct vector<var> objects. Jacobian only uses Eigen’s vectors. I tried to implement something similar to both methods, see lines 132 - 141 and 142 - 210 of https://github.com/stan-dev/math/blob/feature/dogleg/stan/math/rev/mat/functor/algebra_solver.hpp.

The right Jacobian gets computed but with both methods the bug persists.

Some further hints:

  • If I don’t run any of the methods to compute the Jacobian, and just write the Jacobian by hand, the code runs.
  • If the number of states (dim(x)) is greater than number of parameters (dim(y)), the code runs.
  • And then the other major hint is the line that breaks (see post #10).

Still pondering on how to relate the symptoms. In light of hint(2), I can imagine a hack where the dimension of x would artificially be incremented to match that of y, but this seems very risky.

I’d still suggest instrumenting the code with print statements to see if you can diagnose which function call is causing the problem. Until you know where to look, the code’s just too big to diagnose by inspetion.

I don’t see anything immediately wrong with the Jacobian code. Just looking at the code, are you sure that J is getting initialized before df gets called?

If size(x) > size(y) really does fix the problem, I’d look for places where the x and y might be confused in the code or an allocation might be size(x) x size(x) rather than size(x) x size(y).

I’d still suggest instrumenting the code with print statements to see if you can diagnose which function call is causing the problem.

The line of code that breaks is the second iteration of (*it)->chain();. This is made clear in algebra_solver_chain_bis_test.cpp, line 106.

The line which causes the error is inside Jacobian(), see jacobian_test.hpp, line 28:

Matrix<var, Dynamic, 1> fx_var = f(x_var);

I found this by commenting out sections of the code. I created a copy of jacobian(), called jacobian_test() to not mess up the rest of the code. Note f(x_var) ALWAYS gets properly evaluated, and so do the returned objects of jacobian().

are you sure that J is getting initialized before df gets called?

I required df() to print J, and always got the expected result.

If size(x) > size(y) really does fix the problem, I’d look for places where the x and y might be confused in the code or an allocation might be size(x) x size(x) rather than size(x) x size(y).

I went through the code line by line, and added assert statements to check the size of various objects: if the error is here, I haven’t found it.

Is there maybe a method to track memory allocation, that we could implement inside the Stan code?