It sounds like for the short term the GP development is all about efficiently doing the basic GP stuff.

@drezap, if you want to try to figure out how much memory things are using in a more granular way than just checking memory usage of the process, you can check the number of varis being allocated on the autodiff stack.

There are two big types in the autodiff system, vars and varis. Vars get shoveled around everywhere, and they can take up lots of space, but varis are the things that matter. @Bob_Carpenter calls it a pointer to implementation design pattern, but every var points to a certain vari (multiple var variables could point at the same vari) and it’s the varis that hold the values and adjoints that matter for autodiff.

At the end of evaluating a log probability, the only vars that might actually be left sitting around are the lp var and the ones on your parameters. Everything in the middle gets tossed, and it all comes down to the varis that were created in the process.

There are three types of varis, and they’re stored here: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp#L49

You can get information about the number of varis sitting around with code like:

```
Eigen::Matrix<stan::math::var, Eigen::Dynamic, Eigen::Dynamic> x(2, 2), y(2, 2);
x << 2.0, 1.0, 0.0, -1.0;
std::cout << "Chaining stack: " << stan::math::ChainableStack::instance().var_stack_.size() << std::endl;
std::cout << "Non-chaining stack: " << stan::math::ChainableStack::instance().var_nochain_stack_.size() << std::endl;
y = stan::math::adj_jac_apply<MatrixSinFunctor>(x);
std::cout << "Chaining stack: " << stan::math::ChainableStack::instance().var_stack_.size() << std::endl;
std::cout << "Non-chaining stack: " << stan::math::ChainableStack::instance().var_nochain_stack_.size() << std::endl;
```

The output is:

```
Chaining stack: 4
Non-chaining stack: 0
Chaining stack: 5
Non-chaining stack: 4
```

This is showing that the high level parameters that we’re going to autodiff each have a vari and then adj_jac_apply itself creates 5 more. One that goes on the chainable autodiff stack and four that go on the non-chaining stack.

The difference in the chaining/non-chaining stack is more to do with performance, but chaining + non-chaining varis should mostly determine the memory usage of your program when using basic autodiff (things implemented in prim only).

What I said is misleading if you consider functions with custom reverse mode autodiff. Those varis are also allowed to allocate memory in the memalloc_ variable here: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp#L52

But if you wanted to understand memory usage in different kinds of GP kernels, my suggestion would be write some code and just check before and after how many varis there are. You can also watch how much stuff gets allocated in memalloc_, but varis get allocated in there so it’ll be a little convoluted. That’s gonna tell you a lot.

You should be able to account for all the varis that get created, but it might be a bit finicky. Probably worth the time though if you want to optimize the memory usage.

The stuff Aki is talking about here (Question about autodiff for potential GP covfun implementation) is workable, for sure, but there’s probably value in getting all the basic GPs you listed in your #1 above implemented in prim and rev first?