Hi everyone :-)
I’m working on an algorithm for finding better approximations to the mass matrix, that still works in moderately high dimensional models. You can find an implementation for pymc3 here.
It seems to be quite similar in spirit to https://arxiv.org/pdf/1905.11916.pdf, but should be more scalable. It is based on the outer product of gradients instead of the hessian (although the basic idea should also work with that).
We write the mass matrix in the form C = D(1 + V\Sigma V^T - VV^T)D for diagonal \Sigma and D, and V\in R^{n, k} orthonormal. The inner matrix has the eigenvalue/eigenvector pairs V and \Sigma, and all other eigenvalues are 1. Given V, D and \Sigma we can compute the matrix vector products C^{-1}v and C^{1/2}v that we need for HMC without storing the full n by n matrix, just by modifying the eigenvalues \Sigma and the diagonal D.
I use the std of samples in an adaptation window to estimate D. We can now use any eigenvalue/eigenvector estimates we can think of as V and \Sigma.
I am considering two sources for those: A regularized sample correlation and regularized outer product of gradients. The sample correlation gives us small eigenvalues, the outer product of gradients large ones.
We can use a ledoit-wolf estimate for both, but since we do not need the whole matrix, but only large eigenvalues, we can use an iterative algorithm. We do need to compute all entries of the covariance matrix to find the ledoit-wolf regularization parameter, but we can compute those in batches, so that we never need to store the full matrix at any time. I use the implementation in sklearn.decomposition for that. I then combine the two sets of eigenvectors as \exp(V_1\log(\Sigma_1)V_1^T + V_2\log(\Sigma_2)V_2^T) and again compute eigenvalues/vectors iteratively. (Just when I was writing that I noticed that I could also just transform the gradients with the eigvals of the samples, and then find remaining eigenvalues. That should be cleaner.)
I compute new eigenvectors every 200 or so samples during tuning.

See here for an example model with ~2000 parameters that does not sample properly at all with a diagonal approximation, but seems pretty fine with covadapt.

3 Likes

Cool, thanks for posting!

Yeah, the full matrix is a problem for hierarchical models where the log density and gradients can be cheaper to evaluate than the matrix-vector multiplies.

Yeah I agree that this makes sense.

Do you want to code this up in Stan? I’m happy to point you to where the code goes.

I have been slow about getting the adaptation stuff in, but there was still the issue of large hierarchical models which I hadn’t addressed, so I’m glad you’re working on this.

One of the things in that paper you linked is a little heuristic for picking between a metrics. I still want to get that in, but if this way of getting a dense metric leapfrogs what we were doing with Hessians, I have no problems with that :D.

First thing I’d like to see is a direct comparison to what’s in Stan, the Hessian stuff Aki and I did, and your new stuff. We’ll still probably have to do some work separate from what was in the paper to convince ourselves it’s super robust and all.

The code from that paper is all in a repo on Github along with the example models. Are you interested in adding this? If so I’ll link yah (there should be directions at the very end on getting it but I can give you better directions on how things work).

Is this the right code?

Writing that up in stan sounds like fun.
I only have this week until I go traveling for some weeks, and I don’t know if I’ll have enough time to get this done before. But I guess we will see. :-)

Before getting this into any of stan or pymc I’d want to get some answers to those questions (in addition to the performance comparison to the hessian/dense/diag implementations):

• For mostly normal posteriors, how does the estimate for the eigenvectors deteriorate when the number of parameters increases? Just looking at the eigenvalue estimates from different chains it seems that there is a lot of noise when the problems get big. I guess there is also a lot of room for improvement here: The covariance usually has a block-like structure, and the inverse seems to be more sparse. Understanding those structures better would probably also tell us more about how we should regularize and represent those matrices in higher dimensions. (eg this or the introduction of this)
• How does this deal with posteriors that are not even close to normal? Are there problems that work fine with the sample-based diag mass matrix but not with the gradients? And if both do not work, maybe they fail in different ways: if there is a funnel in the posterior, we currently just get small step-sizes and divergences. But with gradients that grow without bounds when you go further down a funnel, how do the eigenvalues of that inner product behave? And for the hessian, some of the eigenvalues could be negative. What would we do with those? I like the idea of automatically choosing a metric, but it would be a shame if that means that some problems suddenly sample much worse than before. The automatic choice depends on the assumption that the posterior is normal after all, right?
• What diagonal approximation should we use? Right now I just use the samples for that, and completely ignore the gradients. But why not use only the gradients instead? Or the geometric mean of the two? I played a bit with this, and it seems that the “diag of sample covariance” workes much better, but I’m not really sure why. Maybe the structure of the matrices is important here again.

I see you wrote a custom lanczos implementation. I have to admit I don’t know that much about this, but as far as I remember there are quite a few numerical issues to deal with there. Are you sure about that implementation? In the pymc code I used arpack though scipy. Would it be feasible to depend on arpack or something similar? I guess that might make distributing stan more difficult, so maybe it is worth looking for a template C++ lib or go with a custom implementation as you did. I’d want to do a fair bit or reading on that topic if we go for a custom implementation.

I agree that block-like structures could be useful, but it seems we can already benefit a lot from a selection criterion and low rank matrices, so I would implement those first.

We have several not even close to normal examples including funnels in our paper. Works better or equally well as the sample-based. Again the selection criterion is important.

Section 3.1 in our paper.

The plan is to use Lanczos from KINSOL which is going to be used also to provide Newton solver for Stan.

I agree that block-like structures could be useful, but it seems we can already benefit a lot from a selection criterion and low rank matrices, so I would implement those first.

I didn’t mean to suggest that we should wait with the eigenvalue based adaptation on that. :-)

We have several not even close to normal examples including funnels in our paper. Works better or equally well as the sample-based. Again the selection criterion is important.

Those examples look reasonable to me, yes. I didn’t go through all of those with the gradient-based approach yet. I only just read the code of the selection criterion, and that there is a train/test split. That sounds like it might help a lot with non-normal posteriors.

That is only about the selection of adaptation, right? But in the code looks like you just take the abs of the eigenvalues. I don’t fully understand what effect that has, but it is similar to the softabs metric I guess.

The plan is to use Lanczos from KINSOL which is going to be used also to provide Newton solver for Stan.

The sundials KINSOL? I didn’t see eigenvalue functions, unless you count the newton solver.

I do not know. What kind of model are you seeing the large variance in eigenvalue estimates? I guess it makes sense somewhat.

I spent maybe a day messing with using the sparsity pattern from the Hessian. Things didn’t work for me. Not to say they wouldn’t work for someone else – that’s just the extent to which I experimented with that. So go on! Explore!

Could always use more experimentation. From what I remember multi-chain Rhat and Neff estimates are effective at identifying situations where things failed. Those won’t necessarily detect the funnel, but the linear transform isn’t going to fix a funnel so hopefully you’d still get divergences.

When things don’t have second derivatives it can mess up pure Hessian-based things (the Prophet model we tested in the paper doesn’t always have all its second derivatives). But if it’s based on the gradients? I dunno what happens give the second derivatives actually aren’t there at some points.

A linear transform isn’t going to fix the funnel, so we expect this to fail (or the eigenvalues to blow up to answer your question). Hopefully the failure is not more dramatic than the sample-based adaptation. It’s possible that watching the value of the highest eigenvalue could be useful for detecting funnel situations.

Assuming positive eigenvalues are good things here (I get confused about the signs), then I ended up only working with the largest few positive eigenvalues. If eigenvalue estimates were to go negative in the few we’re using, we should probably just bail on the adaptation.

Yeah, and we should identify those problems. If this makes something break that wasn’t broken before, that’s bad, and we should understand the problem and try to fix it.

I think the automatic choice depends more on the transformed system being kinda normal (which I guess since we’re doing linear transformations they’re the same)? I’d expect it to blow up in similar ways for dense and diagonal if a linear transformation isn’t appropriate at all. The automatic selection has a few different things mixed in. Not sure how to phrase the assumptions.

You can put curvature stuff in there. It works. The diagonal mass matrix converges really quickly though.

I’m sure it’s a questionable implementation :D. For instance, I’m computing the eigenvalues of the internal tri-diagonal with Eigen’s full eigenvalue solver: https://github.com/bbbales2/stan/blob/7d99729942ee6360a1d04ab8bd7a4a1ebe5cd949/src/stan/mcmc/covar_experimental_adaptation.hpp#L136

In my understanding (mostly just pulling stuff from netlib – which includes the arpack docs), the difficult part of the Lanczos algorithm is making sure your iteratively updated vector (v_j in the Wikipedia notation – https://en.wikipedia.org/wiki/Lanczos_algorithm#The_algorithm) stays orthogonal to the matrix V.

This is difficult for the regular sparse matrix folks cause their orthogonalization is expensive compared to their matrix vectors. I’m just assuming our matrix vector multiplies are somewhat more expensive than the ones arpack people deal with (since we’re getting them through autodiff). I just reorthogonalize every step (inefficiently, too it appears): https://github.com/bbbales2/stan/blob/7d99729942ee6360a1d04ab8bd7a4a1ebe5cd949/src/stan/mcmc/covar_experimental_adaptation.hpp#L124

I’m also not super concerned. If things don’t converge in a fixed number of steps, we should be ready to just switch to using the sample based adaptation anyway for robustness.

Or that hahaha.

You forgot to normalize by the number of answers. Or was that a sneaky attempt to make your post more important than Aki’s? :D

What kind of model are you seeing the large variance in eigenvalue estimates? I guess it makes sense somewhat.

The large variance was in the example with the RNAseq data I posted above. But a standard normal can be trouble already. :-)

I’ve been thinking that it might make sense to change the selection criterion a bit, so that it works better in high dimensions. The problem I see with the one in the paper (just the sqrt(λ_max / λ_min) for simplicity) is that it involves estimating extreme eigenvalues. That is hard in high dimensions though: Let’s say we have a n-dim std normal.
If we just used the empirical correlation C instead of a ledoit-wolf estimate, with m samples and n dimension we get \sum \lambda_i = tr(C) = n, but n - m of those are 0, so \lambda_\text{max} \geq \tfrac{n}{m}, even though the true largest eigenvalue is 1.
This gets somewhat better with the ledoit-wolf estimate, for n=10,000 and m=200 I only get \lambda_{max} \approx 1.6 (compared to ~60 for the empirical).

We could avoid that trouble if we change a bit what the selection criterion is testing. Right now it tries to answer: “Is the system well-conditioned after our transformation?”. This is hard. We could instead just ask “For this particular transformation, is the large variance that we estimated along those directions really there?”. So for a single eigenvector v and test covariance C it could just check if v^TCv \approx \lambda.

So instead of trying to figure out if the transformation fixed all problems (which we don’t know), we could check if it holds what it promised, namely a large (small) variance along a particular direction.

Small illustration with some code: https://github.com/aseyboldt/covadapt/blob/master/examples/metric-selection-criterion.ipynb

I spent maybe a day messing with using the sparsity pattern from the Hessian. Things didn’t work for me. Not to say they wouldn’t work for someone else – that’s just the extent to which I experimented with that. So go on! Explore!

I tried for two days with the same result…

1 Like

Well it was intended the other way, two parts Aki one part me haha.

I had a look at the Ledoit-wolf thing. I don’t really understand how it works or the advantages though honestly.

Is it allowing us to estimate those large eigenvalues more quickly? Based on my experiments I am hesitant of any regularization. I ended up liking the sample covariance matrices a lot.

I’m not in on this in this situation. The advantage of the full mass matrix is that it can rotate and scale things so the sampler works better. If we don’t allow that rotation, then it seems like we’re getting rid of one of our major advantages?

Hahaha, it’s so tempting.

I think these are the easiest things to estimate to do what we want to do though. I could be wrong though. Maybe explain this Ledoit-Wolf thing more and I’ll change my tune.

1 Like

This seems to be the source of the confusion. For n>m they really are not easy to estimate (n dimension, m number of samples). That is what I was trying to show with the \lambda_{max} \geq \tfrac{n}{m} thing. Without regularization of some kind the eigenvector and eigenvalue estimates completely break down if n/m grows. Just take some samples from N(0, diag(2, 1, 1...)), so that the true eigenvalues are 2 and 1, and the largest eigenvector is e_1. The estimates from the sample covariance look completely different:

from scipy import linalg
import numpy as np
import matplotlib.pyplot as plt

n = 1_000
m = 100

diag = np.ones(n)
diag[0] = 2
samples = np.random.randn(m, n)
samples[0, :] *= np.sqrt(diag)

vals, vecs = linalg.eigh(np.cov(samples.T, bias=True))

#_, vals_svd, vecs = linalg.svd(samples.T, full_matrices=False)
#vals_svd = vals_svd ** 2 / m

plt.plot(vals[::-1], label='estimate')
#plt.plot(vals_svd, label='estimate')
plt.plot(diag, label='true values')
plt.xlabel('rank of eigenvalue')
plt.ylabel('eigenvalue')
plt.legend();


And the estimated eigenvector is mostly orthogonal to the true eigenvector:

true_eigvec = np.zeros(n)
true_eigvec[0] = 1
vecs[:, -1] @ true_eigvec
0.024289693893986514


Some form of regularization is essential.
I didn’t really go for ledoit-wolf because I necessarily think it’s the best, but because I got that working without too much trouble. Other options I can think of would be (there is quite some literature about each of those):

• Sparse eigenvectors. Turn the eigenvector estimation from \max_{|x|_2 = 1} x^TCx into something like \max_{|x|_2 = 1} x^TCx - \gamma |x|_0.
• Add a l1 regularization term on the off-diagonal entries of C^-1, and then compute the resulting eigenvalues of C. I’m not sure if we can do this one without storing the whole matrix though.

I don’t want to change the transformation itself, only the way of selection the metric.

1 Like

Oh okay, I repeated your calc and get the same results.

Yeah, I think you’re right. I guess I was thinking in the context m >= n. It’s definitely not true that the full covariance is useful with just a few samples, and it’s only the diagonal that becomes useful really quick (and that’s cause it’s basically super regularized like what you’re talking about).

You’re not saying that this implies we should change the adaptation itself (the L in eq. 13 of the paper), but we should regularize our estimate of \lambda_{max}(L^{-1} \Sigma L^{-T}) because we’re probably getting something that’s super wonky?

That makes sense. And I guess it makes sense to regularize towards the identity there too.

The bit about fixing the direction though – I don’t agree with that part still. Wouldn’t that require some sorta model/data specific knowledge about which direction to check?

1 Like

I don’t want to change the transformation itself, only the way of selection the metric.

Sorry, that was imprecise…

As I see it we have two different steps to consider here:

1. Get estimates for eigenvectors of the largest eigenvalues using training samples.
2. Make a decision if those eigenvectors are worth using, based on the test samples.

I think both need to be changed from the version of the paper for large n.
About the first: Using the largest eigenvalues from the empirical covariance really won’t work here I think. In covadapt I use the largest eigenvalues of the ledoit-wolf estimate, after rescaling with the diagonal. I think there are ways to improve that, but for now I’m reasonably happy with this.
About the second: Again, using the largest eigenvalue of the empirical will break, so we need an alternative. We could just transform the test samples using the eigenvector/value pair, and use some regularized estimate for the largest eigenvalue. The point of the notebook I posted above was to show that this won’t work particularly well either, just because this is the same hard problem we were trying to figure out in the first step. And I think we can use a shortcut here, because we already dealt with the part that requires the regularization: figuring out what directions are interesting. So we could just use the directions different estimators picked, and check how much variance there is along those directions in the test data.

So something like the this:

# After rescaling with diagonal
train_samples, test_samples = split_samples(samples)

# Step 1, get some eigenvector estimates. ie find interesting directions
eigvecs = []
eigvals = []
estimators = [
ledoit_wolf_samples,
empirical_samples,
hessian,
]
for func in estimators:
eigvecs.extend(vecs)
eigvals.extend(vals)

# Find eigenvalues in span(eigvecs) of the empirical covariance of test_samples.
# We are in a tiny subspace now, so $m >> n$ and we can happily use the
# empirical covariance.
eigvecs, eigvals = eigvalsh_restricted(eigvecs, test_samples)

# Step 2: Filter out eigenvalues that do not hold up
# in the grads, or that are too close to 1, so not worth it.
filtered_vecs = []
filtered_vals = []
for vec, val in zip(eigvecs, eigvals):
if np.abs(np.log(val)) < cutoff:
continue

# Maybe this second test is to strict or just unnecessary?
# sample_var * vec should be ~ chi2(df=n_test)
lower, upper = stats.ch2(df=len(samples_test), scale=vec).interval(0.95)
continue

filtered_vecs.append(vec)
filtered_vals.append(val)
return Metric(filtered_vecs, filtered_vals)


This is still similar to the original criterion in some sense: Just instead of trying to figure out the max and min eigenvalues of the transformed test samples in the whole space, we try to figure out the max/min eigenvalues of the transformed test samples in the subspace defined by the eigenvectors. So exp(cutoff) tells us what λ_max / λ_min we are still willing to accept.

So you’re saying that the way I’m doing the test/train split my eigenvalue estimates for the test part are probably just always gonna be kinda bad. Which I believe.

And then one way we might fix this problem is by using eigenvectors from the train set (of which a lot more data goes in) as directions in which to compute variances of the test set?

As a side note since it’s being brought up, I guess I didn’t really put much thought in the way the test/train split happens here. There are probably other options or ways to do it?

And just to be clear in the implementation as is \lambda_{max} is coming from Hessians or gradients? And we’re talking about estimating 1 / \lambda_{min} here by looking at the largest eigenvalue of something like a sample covariance?

Also apologies, but what does the @ sign mean in Python? It’s been a while since I Python’d. Thanks for your patience :D.

Also I just got up pull reqs. for the automatic switching code. Considering you’re pointing out flaws with what I’m doing, if you want to get involved in testing/developing this I’m all for it:

The cmdstan pull is here: https://github.com/stan-dev/cmdstan/pull/729

The stan pull is here: https://github.com/stan-dev/stan/pull/2815

@ is matrix multiplication.
I always looked at the eigenvalues as in the covariance. So large eigenvalues of the hessian or the grads would be inverted before being considered as small eigenvalues during testing. I guess your convention is the other way round?

I think for large n the estimates for the max eigenvalues will always be somewhat bad, no matter how you do the test/train split. There just are just too many directions to choose from. That is why I’d like to lower the expectations for large n a bit, and just check that in some subspace, that we constructed as best as we could to not be orthogonal to the maximum and minimum eigenspaces, we can get significant improvements. For small n that should turn out to be pretty much (or if we are careful even exactly) the same thing, but for large n I’d hope that it works way better. It should prevent that even real improvements get rejected just because the signal is drowning in the noise.

Yeah I’m thinking of the eigenvalues in the ratio as being the condition number of the Hessian. So the large one there comes from curvature calculations. The small curvatures aren’t reliable, so I replace the small curvature eigenvalue with the inverse of the biggest covariance eigenvalue.

And I’m interpreting what you said before as maybe with this Ledoit-Wolf estimate you’d trust the eigenvectors from the train split?

I don’t trust them as an estimate of what the largest eigenvalue of the posterior covariance is. So I wouldn’t say something like: “We found the largest and smallest eigenvalues, therefore the step-size will increase by the factor of whatever.”. But I think they should be good enough to use, especially, if there really is large (small) variance in this direction in the test dataset. We just can’t accurately predict how much improvement there will be.

So if we tested with the test data that along the direction v there is a variance of 50, then I’d be happy to transform it to get rid of that variance, even if there might be some other direction that has a variance of 60. Especially, if we have reason to assume that those two directions are not orthogonal, so that rescaling the first will also help with the second.

I agree.

Alright lemme write it out and tell me what you think. So we have covariances S_{train} and S_{test}.

The steps would be to get regularized estimates of the eigenvalue/eigenvectors of S_{train}:

V E V^T \approx S_{train}

And then we take a subset of those eigenvectors – maybe the ones corresponding to the top 4 or 5 eigenvalues (call these \hat{V}), project our test covariance in those directions, and then compute the eigenvalues of that.

So like since:

S_{test} = \hat{X}_{test} \hat{X}_{test} / (N - 1)

where \hat{X}_{test} is the centered data (and there are N samples). then that looks like:

\lambda_{max}(L^{-1} \hat{V} \hat{V}^T S_{test} \hat{V}^T \hat{V} L^{-T})

And then I think if you group the terms correctly you can take advantage of the the fact that we have more parameters than samples here to be efficient?

I didn’t really put much thought into how the test/train splits here work. So that should probably be brought into question with this. The goal of the split was to make it so that things based on the sample covariance weren’t always labeled the best or whatever.

That sounds about right, this isn’t quite right I think however. Where does the L come from?

I’ll start with standardized samples X here, so we take care of the diagonal approximation seperately. First, we get some eigenvector estimates V from wherever (hessian, empirical eigenvalues, empirical eigenvalues of grads, ledoit-wolf estimates…).
They won’t be orthogonal in general, so we need to do a QR or SVD decomposition to take care of that. The result is an orthonormal \hat V \in R^{n, k}, where k \ll n. We project our testing samples onto the subspace defined by \hat V. So \hat X_\text{test} = \hat{V}^T X_\text{test}. We find eigenvectors \hat U and eigenvalues \Sigma of \hat X^T \hat X/n, and transform back to the original space: U = \hat V \hat U. After filtering out eigenvalues close to 1 (this is pretty much the selection criterion, if all eigenvalues are close to 1, we just end up with the diagonal approximation) we get our final mass matrix estimate: D^{1/2}(I_n + U\Sigma U^T - UU^T)D^{1/2}. If U has shape (n, 0), this is just the diagonal matrix.

I wrote up a testing implementation for simulated data:

I pushed this new adaptation train-test split to covadapt. Let’s see how it does on a few models. :-)
I’ll be happy to work on a stan implementation and help with the PR, but next week I’ll be mostly away from the internet, so it’ll have to wait a bit.

1 Like