I’ve been thinking that we could do continuous adaptation
of a diagonal mass matrix at not much more effort than
we put in now (one divide per parameter per iteration).
We can use a decaying version of Welford’s algorithm to
keep an on-the-fly (co-)variance estimate. We could
even set the decay function up to be a smooth version of
what we have now — putting most of the weight on the
last half of the transitions. I’m pretty sure we could
set it up so that we get a proper regularization term,
or the regularization can come in just at the initialization
in the form of data, such as 20 observations of diag(1).
The problem with a dense mass matrix is the inversion of
the estimated covariance matrix — that’s O(N^3) where N
is the number of parameters. For a diagonal matrix it’s
O(N), with a single divide and assign operation per parameter.
If stepsize adaptation is relatively fast, we could then
do it more often.
I’d think this would be more stable than doing it by blocks.