Hi everyone :-)
I’m working on an algorithm for finding better approximations to the mass matrix, that still works in moderately high dimensional models. You can find an implementation for pymc3 here.
It seems to be quite similar in spirit to https://arxiv.org/pdf/1905.11916.pdf, but should be more scalable. It is based on the outer product of gradients instead of the hessian (although the basic idea should also work with that).
We write the mass matrix in the form C = D(1 + V\Sigma V^T - VV^T)D for diagonal \Sigma and D, and V\in R^{n, k} orthonormal. The inner matrix has the eigenvalue/eigenvector pairs V and \Sigma, and all other eigenvalues are 1. Given V, D and \Sigma we can compute the matrix vector products C^{-1}v and C^{1/2}v that we need for HMC without storing the full n by n matrix, just by modifying the eigenvalues \Sigma and the diagonal D.
I use the std of samples in an adaptation window to estimate D. We can now use any eigenvalue/eigenvector estimates we can think of as V and \Sigma.
I am considering two sources for those: A regularized sample correlation and regularized outer product of gradients. The sample correlation gives us small eigenvalues, the outer product of gradients large ones.
We can use a ledoit-wolf estimate for both, but since we do not need the whole matrix, but only large eigenvalues, we can use an iterative algorithm. We do need to compute all entries of the covariance matrix to find the ledoit-wolf regularization parameter, but we can compute those in batches, so that we never need to store the full matrix at any time. I use the implementation in sklearn.decomposition
for that. I then combine the two sets of eigenvectors as \exp(V_1\log(\Sigma_1)V_1^T + V_2\log(\Sigma_2)V_2^T) and again compute eigenvalues/vectors iteratively. (Just when I was writing that I noticed that I could also just transform the gradients with the eigvals of the samples, and then find remaining eigenvalues. That should be cleaner.)
I compute new eigenvectors every 200 or so samples during tuning.
See here for an example model with ~2000 parameters that does not sample properly at all with a diagonal approximation, but seems pretty fine with covadapt.