I wonder if we can avoid repeatedly creating
sample object when
transition is called, as this involves copying the
Eigen::VectorXd in the
sample constructor. The sampler already has the updated sample in its
z_.q component, and all it needs for adaptation is just the
transition can we overwrite the input
init_sample? @betanalpha Am I missing something here?
The code was written to be as general as possible – taking in one state and outputting another – independent of how those states are used. In particular it intentionally does not assume that the previous state will not be used later on, for example by in-memory summaries/diagnostics/etc.
Currently once we write out a sampler state through the
mcmc_writer and then pass it to the
transition function we never use that state again so that memory could be reused (either by modifying the sample in place or maybe using an r-value pattern?) but then the code would be limited to that particular context.
Overall this is pretty small memory hit, however, especially compared to the overall memory burden used by the transition function internally as is being discussed din a few other threads.
Can you point me to them? Thanks.
It’s small in size, but could be big in terms of memory pressure (how many times we have to malloc) and non-locality (fetching from RAM [instead of cache on a cache miss] is over 100 times slower than arithmetic).
No disagreement on the potential for memory issues. I encountered no end of memory-related performance issues when experimenting with higher-order autodiff implementations back in the day with arena allocators modeled on those used by Stan. At the same time when I added the additional termination checks in the last big PR to the Hamiltonian Monte Carlo implementation, which required three or four new state vectors for each active subtree, there was no appreciable effect on performance over a range target distributions.
I’m not saying that memory can’t be an issue in certain cases, I’m just asking for empirical demposntatiosn that it’s actually becoming an issue in realistic problems.
That’s also what I’d want to see. But I don’t know how to do this. To demonstrate it, I’d probably try to build something faster by being more careful with memory reallocation and show it’s faster. I’m curious how you might show this without a better alternative—I don’t know much about diagnosing/tracing memory issues like this.
I’m going back to my time experimenting with higher-order autodiff frameworks based on Stan’s autodiff framework. The performance differences were drastic once the memory burden became too large for the caches. Since then I’ve always kept an eye out on scaling with system size for and I’ve never been able to see significant affect in the sampler code.
Dedicated tools like
valgrind and their ilk definitely provide a more complete picture of what’s going on.