Measuring memory usage


#1

This is to test performance.

I know there are some unit tests that monitor memory usage on the stack in test/unit/memory/ but this doesn’t quiet do what I have in mind.

Here’s a snipet of code from one of my experiment:

      start_nested();
      vector<var> theta = {theta_dbl[j][0], theta_dbl[j][1],
                           theta_dbl[j][2], theta_dbl[j][2]};
      Matrix<var, Dynamic, Dynamic> A_v(2, 2);
      A_v << theta[0], theta[1], theta[2], theta[3];
      
      Matrix<var, Dynamic, Dynamic>
        exp_A = matrix_exp_2x2(A_v);
      
      //compute Jacobian
      for (int k = 0; k < 4; k++) {
        if (k > 0) set_zero_all_adjoints_nested();
        grad(exp_A(row[k], col[k]).vi_);
        for (int i = 0; i < 4; i++) J(i, k) = theta[i].adj();
      }
      // cout << J << "\n \n";
      recover_memory_nested();

It is not too hard to compute run time – but what would I have to do to monitor memory usage? The goal is to compare different approaches and test which one is most memory efficient.


#2

If you’re doing everything in C++ then you may want to take a look at valgrind, especially the massif package, http://valgrind.org/docs/manual/ms-manual.html.


#3

It looks like installing Valgrind on a recent version of mac is tricky. I myself am using High Sierra. See https://stackoverflow.com/questions/48714807/unable-to-build-and-install-valgrind-on-macos-high-sierra.


#4

Maybe try macports? That is how I installed it without too much pain.


#5

macports work. That said, with my current mac OS sytem and version of X-code, I had to install the development version of Valgrind.


#6

Ok… so, I got massif to work on a very simple “hello world” program. Now, let’s try it with something we’re actually interested in.

First, the C++ code:

int main(void) {

  using stan::math::var;
  using stan::math::matrix_exp_2x2;
  using stan::math::set_zero_all_adjoints_nested;
  using stan::math::recover_memory_nested;
  using stan::math::start_nested;

  using Eigen::Matrix;
  using Eigen::Dynamic;
  using Eigen::MatrixXd;
  using std::cout;
  using std::vector;

  std::default_random_engine generator;
  std::uniform_real_distribution<double> unif(1, 10);

  // Randomly generate entries of the matrix we will exponentiate.
  vector<vector<double> > theta_dbl(100);

  for (int i = 0; i < 1000; i++) {
    double a = unif(generator),
      b = unif(generator),
      c = unif(generator),
      d = unif(generator);
    theta_dbl[i] = {a, b, c, d};
  }

  MatrixXd J(4, 4);
  vector<double> row = {0, 1, 0, 1};
  vector<double> col = {0, 0, 1, 1};

  int N_tests = 100;
  vector<double> results(N_tests);

  for (int n_test = 0; n_test < N_tests; n_test++) {

    for (int j = 0; j < 1000; j++) {
      start_nested();
      vector<var> theta = {theta_dbl[j][0], theta_dbl[j][1],
                           theta_dbl[j][2], theta_dbl[j][2]};
      Matrix<var, Dynamic, Dynamic> A_v(2, 2);
      A_v << theta[0], theta[1], theta[2], theta[3];

      Matrix<var, Dynamic, Dynamic>
        exp_A = matrix_exp_2x2(A_v);
        // exp_A = matrix_exp_2x2_standard(A_v);

      //compute Jacobian
      for (int k = 0; k < 4; k++) {
        if (k > 0) set_zero_all_adjoints_nested();
        grad(exp_A(row[k], col[k]).vi_);
        for (int i = 0; i < 4; i++) J(i, k) = theta[i].adj();
      }
      // cout << J << "\n \n";
      recover_memory_nested();
    }
  }

  return 0;
}
}

Then, compile and run massif from the terminal window:

clang++ -g -std=c++11 -I ../math/ -I lib/eigen_3.3.3/ -I lib/cvodes-3.1.0/include/ -I lib/boost_1.66.0/ -I lib/idas-2.1.0/include/ $1

valgrind --tool=massif ./a.out

This returns the following warning message:

==98198== Process terminating with default action of signal 11 (SIGSEGV)
==98198==  Access not within mapped region at address 0x0
==98198==    at 0x1007A5FA9: _platform_memmove$VARIANT$Haswell (in /usr/lib/system/libsystem_platform.dylib)
==98198==    by 0x10003EC00: _ZNSt3__16vectorIdNS_9allocatorIdEEE6assignIPKdEENS_9enable_ifIXaasr21__is_forward_iteratorIT_EE5valuesr16is_constructibleIdNS_15iterator_traitsIS8_E9referenceEEE5valueEvE4typeES8_S8_ (algorithm:1722)
==98198==    by 0x1000015C2: main (vector:563)
==98198==  If you believe this happened as a result of a stack
==98198==  overflow in your program's main thread (unlikely but
==98198==  possible), you can try to increase the size of the
==98198==  main thread stack using the --main-stacksize= flag.
==98198==  The main thread stack size used in this run was 8388608.

Segmentation fault: 11  valgrind --tool=massif ./a.out

even though the executionable itself runs fine.

Looking at the output with:

ms_print massif.out.98198

we get rather long and intricate print. My instinct is to look at total (B), the bytes used on both the useful and the extra heap. I get 91,904, when with the hello world code I got around 80,000. I’m not sure how to interpret these results and whether to trust them.

There’s a lot documentation to dig into, but I’m hopping someone with experience doing these tests can give me some pointers. Thanks!


#7

Another thing that you may want to try if valgrind is giving you trouble is Xcode’s complementary Instruments application. In particular the Allocations module.


#8

Memory is allocated in different ways. There’s a heap of memory allocated in C++ and then we implement our arena within that heap. There is also stack-based memory, which is managed separately and holds the actual object member variables. For example, a std::vector has a pointer and size stored on the stack and then the memory for its elements are on the heap.

Instead of using the heap for long term memory, Stan uses its own memory arena. Almost all of the memory Stan itself uses is in that memory arena after the log density is calculated. Everything else in memory is in the noise.

Like many programs, Stan allocated memory in exponentially increasing blocks. It doesn’t allocate per expression. So you’ll get these big jumps where it goes from 1MB to 2MB to 4MB to 8MB, etc., which will really mess with your evals through something like valgrind. Those tools will also have a hard time keeping track of our memory allocation and deallocation on the arena and will gripe about memory leaks, etc., when there in fact aren’t any.