Riemannian HMC Q

A while ago I hooked up some custom solvers via C++ into Stan using the cmdStan user headers and stored_gradient_varis.

Can I do the same sort of thing for the Riemannian manifold HMC? Basically if I provide an interface to my function that looks like:

fvar<fvar> function(fvar<fvar> params, …)

Can I just hook that up with the RMHMC stuff? Is the RMHMC available through the interfaces?

I guess that question is really just a symptom though. My actual problem is that the number of HMC steps NUTS wants me to take is really large (very often it hits the 1023 default cap) without finding a U-turn (each sample can easily take a couple minutes). I’ve looked a little at the traceplots, and it looks to me like NUTS is correct and it really needs all these steps to generate samples. It’s my understanding that the RMHMC generates way more effective samples per MCMC sample and I want to take advantage of that.

My models are kinda small (I’m happy to post one – they’re really simple). The data is just a handful of numbers and the models will likely always have somewhere between 5-25 parameters. Evaluating my likelihood is kindof slow though (hundreds of milliseconds).

I think I found analytical 3rd partial derivatives for my problem (eigenvalues of symmetric matrix), so I thought I could give this a go if it’s just a matter of me coding up partials and testing them.

Please post one. It’s always good having an example. Maybe there’s something a little simpler than trying to get RHMC working.

Awesome. This might be tough to implement. Not sure.

Here is a really simple model. I’ll just blabber a little. Read what you want ignore what you don’t want.

The idea (people been doing this since at least the 70s) is measuring the natural resonance modes of a lump of metal and trying to estimate the elastic constants.

I get really tight confidence intervals for N as low as 20. What makes me nervous about the inference we’re doing here is that y is really just a list of resonance modes from a single sample of metal. So it’s like we’re doing statistics with a single sample. But! The results we have are very consistent with the reference mechanical data we have (literature) and the measurements we make (Xray diffraction stuff), so I like it (and it works a loooot better than trying to do an optimization).

Now that I type this – I have an unresolved suspicion that if I didn’t model the crystal-sample misorientation (q) is the thing that’s causing this to be hard to estimate. That’s the unit_vector parameter. There’s a ton of symmetry in the system, so technically the posterior on that should be super-multimodal, but usually it isn’t too radical (maybe two modes max?). I need to test if this causes the sampling difficulties instead of just talking about it…

The Xtal-sample misorientation parameter (q) is the misorientation (as a quaternion) between the sample reference frame and the Xtal lattice. This is important for the resonance modes. Think of rocksalt, where your atoms are arranged in a cubic structure. The misorientation parameter means that cubic grid of atoms (xtal reference frame) doesn’t need to line up with the actual cubes of salt (the sample reference frame).

One of the big issues with this is I really don’t want to turn down the precision of the resonance mode calculations (lower precision calculations is one way I can do my inferences faster). I know this can make weird biases if the model is off (I’ve seen it). I think we got really lucky with our data/model in that it actually works really well, but I gotta keep P high (order of Rayleigh-Ritz approximation to eigenvalue problem) or I might accidentally be biasing my results (and Stan won’t be able to help me there).

// These are some externally defined functions that do the RUS forward model.
//   That is, they take in elastic constants and produce resonance modes
functions {
  vector mech_init(int P, real X, real Y, real Z, real density);
  matrix mech_rotate(matrix C, vector q);
  vector mech_rus(int P, int N, vector lookup, matrix C);
}

// Input data
data {
  int<lower = 1> P; // Order of polynomials for Rayleigh-Ritz approx
  int<lower = 1> L; // This is a function of P :(
  int<lower = 1> N; // Number of resonance modes

  // Sample known quantities
  real<lower = 0.0> X;
  real<lower = 0.0> Y;
  real<lower = 0.0> Z;

  real<lower = 0.0> density;

  // Resonance modes
  vector[N] y;
}

transformed data {
  vector[L * L * 3 * 3 + L * L + L * L * 3 * 3 * 21 + L * L * 3 * 3] lookup;

  lookup = mech_init(P, X, Y, Z, density);
}

// Parameters to estimate
parameters {
  real<lower = 0.0> c11;
  real<lower = 0.0> a;
  real<lower = 0.0> c44;
  real<lower = 0.0> sigma;
  unit_vector[4] q; // rotation between sample & xtal axes
}

// Build a 6x6 stiffness matrix and rotate it
transformed parameters {
  real c12;
  matrix[6, 6] C;
  
  c12 = -(c44 * 2.0 / a - c11);

  for (i in 1:6)
    for (j in 1:6)
      C[i, j] = 0.0;
        
  C[1, 1] = c11;
  C[2, 2] = c11;
  C[3, 3] = c11;
  C[4, 4] = c44;
  C[5, 5] = c44;
  C[6, 6] = c44;
  C[1, 2] = c12;
  C[1, 3] = c12;
  C[2, 3] = c12;
  C[3, 2] = c12;
  C[2, 1] = c12;
  C[3, 1] = c12;

  C = mech_rotate(C, q);
}

model {
  // Specify a prior on noise level. Units are khz, we expect ~100-300hz in a good fit
  sigma ~ normal(0, 2.0);

  // Specify soft priors on the elastic constants
  c11 ~ normal(1.0, 2.0);
  a ~ normal(1.0, 2.0);
  c44 ~ normal(1.0, 2.0);

  // Resonance modes are normally distributed around what you'd expect from an RUS calculation
  y ~ normal(mech_rus(P, N, lookup, C), sigma);
}

A shiny new quarter that your model isn’t identified in a subtle direction, so that the resonances look identified but the posterior itself is pathological and the dynamic HMC trajectories are forced to grow too large. RHMC may help a bit in this case, but any improvements shouldn’t be drastic.

A possibly related issue is that, because the rotation group has a different topology than the reals, any model with a completely free rotation will be difficult to fit in Stan directly (regardless of how the rotation is parameterized).

Shiny new quarter

Lol, Is there a wagering mod for Discourse?

identified in a subtle direction

Yeah, that’s a reasonable guess. I’ll go start some inversions today with simulated data so we can have some concrete samples to talk about (instead of me just going by what I think I remember). My previous attempts to visualize these rotations failed, but I do remember at least one weirdness with them.

regardless of how the rotation is parameterized

Well that’s depressing, haha. I used this stuff before and liked it: https://arxiv.org/abs/1301.6064 . You think there’s more hope in integrators like that?

I never worried too much about the orientations of this stuff when I was doing that because the traceplots for the orientations always looked super mixy. It was always the elastic parameters that struggled to move around. That isn’t to say there wasn’t a problem though.

As a follow up on this, I switched from using quaternions to this thing called cubochoric coordinates (http://dx.doi.org/10.1088/0965-0393/22/7/075013) and now the sampling seems to be behaving well. The treedepth in the model now bounces between like 5-6. Solutions check out with data.

The cubochoric thing amounts to an area preserving map from a cube of a certain side length to RO(3). There’s a neat little transform in that paper that is like a higher dimensional sorta Lambert equal area projection to go from spheres in R^n to surfaces of spheres in R^n+1 (so you don’t have to worry about the Jacobian adjustment). Marc de Graef (one of the authors on the paper who I got this from) is a material sciencer at CMU who’s been working through this orientation stuff a lot lately.

There’s still the topological issues, but I think cause of the symmetries of the system I’m working with there’ll always be solutions away from the boundaries. I honestly didn’t work through the differentiability of everything before I just wrote up the conversions I needed (rather, copied them from a colleague) and let Stan handle the rest.

Re: the identifiability, when I investigated what the chains were doing, there were sometimes big correlations between pairs of quaternions, and sometimes there were divergences clustered in suspicious places (like at the end of a thin ridge of correlated parameters). But it wasn’t super consistent run to run. I’m assuming the problem was just unidentifiability that comes from taking v = x / ||x|| where x ~ normal(0, 1) (leaving the length kindof unidentified)? I thought maybe changing the standard deviation parameter on x might affect this in some way but when I tried making it bigger or smaller nothing really changed.

tldr; I guess I’m not exactly sure why quaternions did not work and the cubochoric ones do, but the second behaves much better for my use case.

Anything like x / || x || is asking for trouble. Again it’s a manifestation of the tricky topology of rotation space and the difficulty in trying to embed it into real numbers.

1 Like

What’s || x || here? If it’s the length of a vector, then you can just declare x as unit_vector[K].

unit_vector being a bit of a hack that secretly introduces an additional variable and a “Jacobian” that’s really an auxiliary density to ensure that the non-identified direction is somewhat contained.

It’s all the same root issue – measures on spaces of different topologies can’t be mapped into each other exactly. In order to embed something you have to add additional variables and additional probabilistic structure to ensure that the embedded measure can be normalized.

The key point is that it gives us a uniform distribution by default over unit vectors.

I am not sure I understand what you mean in your second paragraph.

I wonder if you could give some idea how what you’ve said relates to the approach Diaconis et al. lay out in their paper “Sampling from a Manifold.” (It seems like a paper you might be familiar with; if not, no worries.)

They give a number of examples and a general procedure for sampling from a probability distribution on an m-dimensional embedded submanifold of R^n (n > m, of course) by sampling from an appropriate distribution on an unconstrained subset of R^m. In some of these cases, e.g. the torus, the topology of the submanifold is different from that of the unconstrained subset of R^m. They don’t seem to add variables, since the number of unconstrained variables, m, is the same as the dimensionality of the submanifold.

Diaconis et al consider the torus only for methods that generate independent samples, namely rejection sampling. The problem with a surface that is topologically distinct from R^{N} is that it can no longer be covered by a global coordinate system, so there will always be boundaries outside of which any given coordinate system will no longer be valid. For example, in a torus there will be the circle at theta = 0/theta = 2pi and phi = 0/phi = 2pi and for the sphere it will be at a pole. For independent sampling methods these defects can be ignored because they are measure zero. Markov chain Monte Carlo, however, cannot ignore them because the Markov chains have to pass through these defects in some principled fashion.

Note that Diaconis et al do discuss a Gibbs sampler on a scaled simplex, but that it isomorphic to R^{N} which is why we have a simplex type in Stan.

Long story short, Stan is currently written to assume distributions specified with densities over some version of R^{N}. Because we can’t map a topologically distinct space into R^{N} isomorphically we cannot implement distributions over those spaces without some kind of hack, such as adding auxiliary variables with an implicit prior that identifies the larger model as we do for the unit vector.

1 Like

Is the issue that the proposals wouldn’t be generated in a way that’s natural or optimal when you consider the geometry of the original space? Or are there more theoretical problems where the chain wouldn’t converge to the desired limiting distribution if, for example, Diaconis et al. had used MH instead of rejection sampling for the torus example.

Formally Hamiltonian Monte Carlo will work on any manifold, but the particular implementation in Stan assumes something isomorphic to R^{N}. We could generalize this for certain manifolds, but it it would be a significant implementation challenge to keep track of everything. There can be similar issues with other MCMC transitions, although the details are specific to each case.