We’re trying to get our heads round how to estimate a good mass matrix for an SMC variant of Stan. We are therefore trying to understand how Stan currently identifies the mass matrix to use.
It appears that Stan currently estimates the covariance of the samples (modulo assuming that matrix is diagonal or full-rank etc) and uses this estimated covariance as the inverse mass matrix. That seems sensible enough.
However, we had wondered whether it would be better to use the evaluations of the gradient of the log posterior at the sample points (which have already been calculated) to estimate the Fisher information matrix (as the expectation of the gradient multiplied by the gradient transpose). Is there a reason why that would be a (good or) bad idea? I sense that there may be an issue that we haven’t spotted and I’d like to be aware of the quagmire before we walk into it!
It’s a good idea. See @bbbales2’s experiments in Selecting the Metric in Hamiltonian Monte Carlo
There have been also further discussion in Discourse threads.
It’s just slow to test it thoroughly and add necessary pieces to Stan.
Equation (11) in that arxiv paper appears to me to consider a (sample-based) approximation to an expectation of a numerical approximation to the second differential. That’s two approximations.
I think we can get rid of one of them by exploiting the identity that the Hessian is (both the expected second derivative of the log posterior and) the expected square of the gradient (ie we don’t need the numerical approximation of the second differential). If that’s not a silly idea, we can run some tests and see how these approaches compare.
My conclusion playing with curvatures and samples was in the long term, if you can estimate the posterior covariance using MCMC samples, do that.
If you don’t have that, the few highest curvature eigenvalue/eigenvector pairs were useful.
It was not always useful to use all the eigenvalue/eigenvector information. The thing that convinced me of this were plots like Figure 1. If you take a draw and plot in the direction of eigenvectors that correspond to the big curvatures stuff looked quadratic. If you did it elsewhere, you saw all sorts of funky stuff.
I did not have much luck playing with expectations of Hessians. I forget if that is because they just weren’t better than anything else, or what. In retrospect, I think I path-forked away from this idea too early in the project to actually say for sure if it was a bad idea or not.
Anyway I like the idea of using the gradient information you’re computing along the way to try to do Hessian stuff.
This can probably be explained by much information gradients provide. Let say we have S target function evaluations. We can approximate gradients with finite difference, and thus each gradient corresponds to instead having D (the number of dimensions) extra target evaluations very close to the original S draws. How much information these finite difference evaluations have compared that you could have (D+1)*S draws spread around? Yes if you can get the gradients free, you will get some free information, but in high dimensions it’s likely that the gradients (via finite difference analogy) don’t have much useful information.
What you say makes sense, but then I wonder: doesn’t the performance of NUTS/HMC tell us that gradients do have lots of useful information in them? Can you reconcile those points of view?
I think about this as curvature as short range and samples as long range, and curvature stuff only does well as long as a normal approximation does well, which is often not that well.
That control variates stuff you were looking at is probably a better way to think about how to plug together gradients and values though.
@bbbales2: I think the thing I struggle to have intuition about is that the function we are exploring is a pdf. That means that the expectation of the log gradient is zero (assuming that the parameter inhabits an unconstrained space). That’s the cunning trick that the control variates (which we must get back to integrating in Stan) exploit and my sense (which seems to differ from @avehtari’s intuition) is that gives the gradients some “global awareness” which seems important to me. I think that’s just a hunch though.
In any case, we are taking a look at this, running some comparisons and trying to generate some results that will help us understand what works well when. Watch this space!