New Transform for Orthonormal Matrices in Stan


#1

Hi All,

I’ve been working on a way to get orthonormal matrix parameters in Stan for models like probabilistic PCA and @bgoodri and @Bob_Carpenter told me to post it here to see if there’s any interest and spark discussion. Here is the paper and the Stan code.

Overview
As some you may know there are custom HMC methods by Brubaker as well as Byrne and Girolami that allows you to have models with orthonormal matrix parameters. I was using a custom implementation of Byrne’s method, but I wanted a way to do it in Stan so I wouldn’t have to write my own likelihoods and I can use Stan’s NUTS implementation. Not to mention we wanted to also use Stan’s VI.

To accomplish this I came up with a transform that describes any orthonormal matrix in terms of a set of rotations, and hence parameterizes them in an unconstrained space. I then figured out how to do the associated change of volume measure for this re-parameterization and we coded it up in Stan.

The first test was comparing Stan draws from a uniform posterior over the Stiefel Manifold to uniform draws via the normal QR trick. After that we build PPCA and some more complicated models including hierarchical PPCA.

Issue with reflections and charts
I want to point out there is a reflection non-identifiability when you’re inferring a basis for PCA. We handled this by limiting the angle of rotation to 180 degrees instead of the full 360 for some the angles. This also is part of the reason it’s not such a big deal in practice that the circle-like topology is being transferred to an interval. This is described in the paper. I also want to point out that if you posterior is on the edge of one of rotation angle boundary you will have a multi-modal posterior, but you can just re-orient the chart in that case to go from [-pi/2,pi/2) instead of [0,pi), but yes this has to be done manually as of now.

Scaling with matrix size
To compute the change of measure for an n x p orthonormal matrix you have to calculate the determinant of a matrix that has np-p(p+1) rows and columns, so be wary of this. We started using VI instead of HMC on our bigger problems.


#2

This method does not fully parameterize the space of orthogonal matrices. It instead defines a chart over a Steifel manifold that will necessarily always have a slice of parameters that it cannot describe (i.e. the chart is not isomorphic to the entire space). All of the issues with reflections and reorientations are consequences of this.

It’s the same story that occasionally pops up about circular parameters – if you try to parameterize them as [0, 2pi] or [-pi, pi] then you run into consistent issues at the boundaries. If the posterior is sufficiently concentrated then it is possible to orient the coordinates so that the posterior doesn’t have appreciable mass across the boundary, but it’s tricky to verify when the posterior is tight enough and even then you have to know where it’s concentrating. This typically results in an awkward, iterative process.

That same process is technically possible for orthogonal (equivalently orthonormal) matrices as implied by the paper but it becomes extraordinarily harder as you now have to understand how the posterior concentrates in a multidimensional space.

Models over the space of orthogonal matrices are becoming more and more relevant, but the only way that they can principly be implemented in Stan (and all of these identification issues avoided) is to seriously change the algorithmic backend to enable the geodesics as described by Simon and Mark’s paper.

Everyone is welcome to experiment with this code, but please be very careful as failures many not be immediately obvious. Again, see how subtle one-dimensional circular parameters are to fit using similar techniques already.

</methodological pedant>


#3

Yes, you’re totally right about the slice that can’t be described with a single chart because of well… topology. A lot of the times I think you don’t want to necessarily represent the whole Stiefel Manifold though, lest you have the reflection issues, which in my experiments show up in Byrne and Girolami’s method as well.

As an example if you’re doing, PPCA on the 2D data plotted below and you want to find the first principal component, both the red vector and blue vector are equally valid models, and I would say one of them is a superfluous reflection. This will give rise to a multi-modal posterior with little mass in-between in the modes that not even Byrne and Girolami’s method can solve.

In our parameterization it was straight-forward to just limit the angle of rotation to just go from [-pi/2,pi/2). I’m not sure how one would achieve a similar effect using Byrne and Girolami.

One could imagine though, that at this point you might have data that points straight up in which case, there’ll be posterior mass on both the vector pointing up and down, and you’ll have the multi-modal thing going on. In that case I suggest rotating the chart, like you’ve mentioned to [0,pi). That was a good point though I didn’t realize that changing the chart might actually be more subtle though because really what we care about is where the mass in the joint posterior is not the marginal mass.

Models over the space of orthogonal matrices are becoming more and more relevant, but the only way that they can principly be implemented in Stan (and all of these identification issues avoided) is to seriously change the algorithmic backend to enable the geodesics as described by Simon and Mark’s paper.

I agree that having to fuss around with the charts is not principled or possible even practical, but I don’t think the Embedded Manifold Geodesics is necessarily the only way. For example, I know the people who solve ODEs on manifolds often do a chart switching scheme where when they get close to the edge of the current chart the code automatically switched to another chart on the backend.

If we’re talking about the software engineering of actually getting this in to Stan, the question is would that be harder or using a different embedded integrator on the constrained parameters.

I also noticed when I used the Embedded Geodesics method, the matrix exponential operation could at times have trouble numerically, and the HMC trajectory would get off the manifold or “slide off the tracks” so to speak. That was another reason we sought a re-parameterization because it keeps us on the manifold no matter what, even for larger HMC step sizes.

The last thing I’ll mention is that if you want to do hierarchical models over orthonormal matrices the first prior that comes to mind is something like the Matrix Langevin distribution. The problem with that though is that it has a PDF that’s intractable to compute. Alan Edelman (numerical analyst at MIT) had a paper on speeding up the computation involved there, but it’s still pretty difficult.

The nice thing about our transform is that we can achieve the hierarchical prior effect by simply placing something like a multivariate Gaussian over the transformed parameters.

plot


#4

This is not due to the Stiefel manifold but rather how expectations are calculated on non-Euclidean manifolds. In particular you can’t just use the naive Monte Carlo estimators because of the chart switching – there is extensive literature on estimating moments on spheres that is directly applicable.

This sounds like an error in the implementation. If the analytic geodesics, which are available for Euclidean manifolds (lines), spheres (global rotation), torii (component rotations), and Stiefel manifolds (matrix rotations), are being implemented correctly then their application will never cause you to leave the manifold (modulo floating point errors) no matter how large of a step you take.

To be explicit I am referring to Byrne and Girolami and not Brubaker et al which can run into numerical issues in staying on the manifold.


#5

This is not due to the Stiefel manifold but rather how expectations are calculated on non-Euclidean manifolds. In particular you can’t just use the naive Monte Carlo estimators because of the chart switching – there is extensive literature on estimating moments on spheres that is directly applicable.

Hm, I think I get what you’re saying about not being able to compute expectations depending on what coordinates you use but could you elaborate on the other part? If you’re using the special case of the Stiefel manifold where n=2 and p=1, i.e. a 2D vector constrained to be of length 1, there are two such vectors on the Stiefel manifold with equal likelihood for the PPCA plot I posted. This leads to a multi-modal posterior even in the Byrne method, at least in our experiments. All I’m saying is that with this transform you can limit the angle of rotation to eliminate that superfluous, reflected mode.

As far as expectation over coordinates, I agree that expectations over (x,y) coordinates are not interpretable, which is another benefit of the angle rotation. In Stan we’d get an expectation of approximately pi/2 for the toy problem I posted, and a confidence interval on that would actually make sense .

This sounds like an error in the implementation. If the analytic geodesics, which are available for Euclidean manifolds (lines), spheres (global rotation), torii (component rotations), and Stiefel manifolds (matrix rotations), are being implemented correctly then their application will never cause you to leave the manifold (modulo floating point errors) no matter how large of a step you take.

The Byrne paper provides the closed-form solutions of the Stiefel geodesics in the embedded (original matrix) coordinates, but they are in terms of the matrix exponential. Under the hood, matrix exponential code is usually either a Pade approximation or maybe a Runge-Kutta solver like RK4 (see Cleve Moler’s famous paper). So even though these are “closed-form solutions”, in a sense, we still have to numerically integrate an ODE, and that’s going to prone numerical error just like any other ODE problem. Our implementation could be faulty, but a couple people in my lab implemented it, including myself, and we found that for large step-sizes the coordinates left the manifold when we used the matrix exponential implementations in R and Python.


#6

I’m interested in PPCA with Stan so thanks for posting this. I have hit my head against the wall enough to take betanalpha’s warnings seriously, even if I don’t grok them.


#7

This has to do with the model being non-identified, not anything particular about the geometry or topology of the Stiefel manifold. Knowing that you have this symmetry one would naturally be motivated to look at an appropriate quotient space which would end up giving you some kind of projective geometry, maybe? But that geometry would be much more complicated than a preselected chart because it wouldn’t constraint the orientation of the quotient.

If there were circumstances where you knew the orientation then you could get away with a subset of the Stiefel manifold that would then be isomorphic to one of these charts (and eventually R^{N}). Same way that the space of simplifies is isomorphic to an orthant of a hypersphere which we can cover with a single chart.

But you have to know the orientation a priori for all of this to work. My concern is that in most cases you don’t, so the only way for the methods proposed in the paper to work would be an empirical Bayes approach where you use the data once to estimate the orientation and then again to fit a model using that fixed orientation. That’s can be a dangerous game, especially in high-dimensional spaces.

But any decent matrix exponential solver, regardless of whether you go with an ODE based solution or the more typical Pade approximants, should have tolerances and err out with informative messages if the tolerances aren’t met. Then you can either retune the algorithm or go to a smaller integration time.

All of that said the resulting geodesic should decompose into a sequence of Givens rotations that can be performed analytically within the current chart (with the reentering of charts as needed to contain the entire step).


#8

Ah ok yes. I understand what you meant now. Yes this is not a problem with the manifold itself, but the model (this case PPCA). Sorry, I probably phrased my first post badly.

Yes, I’m still with you, and that I think is one of the advantages of working in these transformed coordinates. It makes it easier (IMO) to reason about what that quotient space that you need is, then allows you only consider that quotient space in your Stan program by simply using the constraints in Stan e.g. theta real<lower=-pi/2, upper=pi/2>. In contrast, I don’t know how straight-forward it would be to express the desired quotient space in the embedded coordinates that the Byrne method uses. Even if you were able to do that how would you modify the algorithm to make sure we don’t leave the quotient space?

I’m a little lost on what you mean by projective geometry and the orientation of the quotient space here. Going back to the toy example I posted, do you mean the orientation of the vector that explains the data?

I guess it gets a little confusing because there’s the step size from the HMC that gets multiplied by a matrix then passed in to the matrix exp function. Then there’s the “step size” that the internals of the matrix exponential solver uses. In my experiments when the former is big you run in to problems, but maybe that could be resolved by lowering the latter?

That’s an interesting point. I wonder if there would be numerical advantages to using Givens rotations to move along the geodesics in Byrne’s algorithm instead of just computing the solutions in the embedded coordinates using the matrix exponential. But again, what if you need that quotient space for your model?

By the way thanks for the stimulating discussion! This has been way more useful/informative than the reviews I got when I submitted this to a machine learning conference.


#9

Your proposed coordinates fix the orientation of the coordinate system. The relevant quotient space would indeed be isomorphic to these coordinates if the orientation was fixed. But in general the orientation is not fixed so you’d have to augment the coordinates with at least one orientation vector which will bring the topological problems back in.

Projective spaces match opposite points to each other, for example points on opposite sides of a globe, forming the quotient space.

I am referring to the orientation of the coordinates. Think where exactly you place the slide of ill-defined values.

Yes, that was my point. Larger integration times, i.e. the t in exp(t A) which ends up being the HMC step size, may indeed stress the numerical solver but that should be remediable by tuning the parameters of the matrix exponential solver itself.

If Givens rotations are indeed analytic solutions to the matrix exponential (to be clear, all of the analytic geodesics in the other spaces are also solutions to matrix exponential equations) then would be preferred given their significant speedup.

The quotient space is another issue as it’s an entirely different geometry and topology.


#10

So just to use the toy example because it helps me to reason about this better when we’re not speaking in person. Are we talking about how the half-circle is embedded in R^2, i.e. whether we use the support [-pi/2,pi/2) versus using the support [0,pi)?

Say we use the former embedding, then we do HMC and find out there is appreciable mass at both boundaries, so decide to use the latter embedding. What are some ways this can lead to a misleading analysis? When I think of empirical Bayes I think of getting a point estimate then basing your prior on that, which will certainly affect where posterior mass concentrates in a way we don’t want (see e.g. the Rat Tumor example in chapter 5 of BDA3). As far as I can tell here (and I could be wrong), the way we choose to embed the half circle in R^2 does not affect the way posterior mass concentrates (tight or narrow, or what have you). It only affects the coordinates with which we choose to describe where the posterior concentrates, and the only reason we’d want to do this is because HMC has trouble jumping between those masses at the boundaries because there is usually so little mass in between them.

In other words, could the embedding of the half circle in R^2 that you use lead to two different ways that you interpret your posterior that are contradictory to each other? If we fit a PPCA in the [-pi/2,pi/2) coordinates and most of the mass concentrates at the boundaries, then that tells us the first principal component points up (or down which we know are equivalent). If we changed the embedding to the [0,pi) coordinates we’d see the mass concentrate near pi, which also tells us the first principal component points up. These are not contradictory. Even if the two versions of the posterior did lead us to contradictory inferences, we’d know to just use the second embedding, right?

I just want to better be able to see how this leads to negative issues in practice. Again, thank you for you patience!


#11

Fundamentally the Givens rotations do not parameterize a Stiefel manifold. They can be considered conditional coordinates only conditioned on an orientation. Hence if you consider the joint space of rotation + orientation as the complete picture then working with a fixed orientation informed by the data through the previous fit would be akin to empirical Bayes.

The particular parameterization of the Stiefel manifold would indeed not affect the posterior inferences, but the Givens rotations leave a line defect in the parameterization which changes the space and consequently the model. Now technically this difference should have zero measure under any complete model but in theory one could come up with posteriors that put nonzero mass on the defect, a la a spike and slab prior.

In practice, however, we wouldn’t consider those models as they are bad regardless of the implementation. Which means that the biggest concerns are practical – unless there is vanishing mass in the neighborhood around the defect then the space of Givens rotations will have significant amounts of mass near the boundaries (at extreme values if mapping all the way to R^{N}). As I said before, if your posterior strongly concentrates in some small neighborhood of the Stiefel manifold then the space of Givens rotations would be sufficiently well behaved for computational algorithms, at least provided that you could pick out the correct orientation. If the posterior does not concentrate sufficiently then using the Given rotations will stress the algorithms to the point where their accuracy guarantees are suspect.


#12

Can you elaborate on this part, i.e. how significant mass near the boundary stresses in the algorithms and results manifests in poor accuracy? Can this lead to numerical issues and/or incorrect posteriors? Does it have to do with the sampling the angles on the logistic scale and most of the mass being near the boundary?


#13

You start with a compact space then slide it to introduce a border. Mass concentrating near borders causes all kinds of problems, for example needing an algorithm with increasingly small length scale to resolve the structure of the mass near the boundary or increasingly long sojourns if you try to map that boundary to infinity.


#14

You may be able to come at PCA problems from another direction. This paper
https://arxiv.org/abs/math/0606320
shows that all KxK orthogonal matrices can be represented as
R = E * (I - S) * (I + S)^{-1}
where S is antisymmetric and E is diagonal with each diagonal entry being -1 or +1. That is smooth with respect to the strictly lower triangular elements of S but discrete with respect to the diagonal elements of E. For small K, you could marginalize over the 2^K possibilities for E but that would be tedious and slow for large K. Or maybe you could fix some of them.

In a SVD, the two orthogonal matrices are being multiplied by a diagonal matrix of singular values, which are traditionally defined to be positive. But if you allow each singular value to be either positive or negative, then that is tantamount to multiplying a positive singular value by +1 or -1. So, if you let

matrix[K,K] S_u;
matrix[K,K] S_v;
int pos = 1;
for (j in 1:K) {
  S_u[i,i] = 0;
  S_v[i,i] = 0;
  for (i in (j+1):K) {
    S_u[i,j] = u[pos];
    S_u[j,i] = -u[pos];
    S_v[i,j] = v[pos];
    S_v[j,i] = -v[pos];
    pos += 1;
  }
}
Ut = inverse(I + S_u) * (I - S_u);
V  = (I - S_v) * inverse(I + S_v);
X =  Ut * Sigma * V;

and make the diagonal elements of Sigma be an ordered[K] sigma instead of positive_ordered[K] sigma;, then you may be able to make progress from there with unrestricted u and v vectors of size K choose 2.


#15

Thanks for the reply Ben!

I’ve seen the Cayley Transform before, but I haven’t seen the version with the E out front that has discrete elements. I have to read this more carefully, but I’m wondering if fixing some them would be equivalent to reducing the Stiefel Manifold to the quotient space that Michael and I were talking about earlier. I think getting that quotient space is important so that we don’t have non-identifiable posteriors in PPCA and related problems.


#16

@arya Happen to get an example working for this? Just encountered a data set where I need to do PPCA too.


#17

Yes, here is a link to my Github with our PPCA implementation in Stan. I haven’t been good about keeping the code up to date, so it may not be the fastest/cleanest version (we’re still trying to get the manuscript accepted somewhere). In the mean time let me know if you have questions.


#18

I am not really a geometry guru, so the finer details of the following paper could maybe use a cursory glance from someone a bit more proficient from this community, but this looks like an approach that might actually be tractable in Stan:

It uses the Cayley transform on densities over skew-symmetric matrices to get a family of densities on SO(p) interpolating between the uniform density and a tangent-space Gaussian, with the restriction that they do not sample elements with eigenvalue pairs of -1 (which they argue is a neglible set anyway). They have a simulation/sampling algorithm, results on marginals and moments, and explicit normalization.

I guess the salient point is whether the Jacobian correction for their transformations is easily available in Stan?


#19

Thanks for sharing the link. After skimming it, I think this might actually be viable but I have to read papers like this 10 times to grasp all the details.


#20

Upon further review, I would say definitely maybe. If, based on the paper, you just run

data {
  int<lower=3> p;
  real<lower=0> kappa;
}
transformed data {
  real log_pi = log(pi());
  matrix[p,p] I = diag_matrix(rep_vector(1, p));
}
parameters {
  vector[choose(p, 2)] s;
}
transformed parameters {
  matrix[p, p] S; // antisymmetric
  {
    int pos = 1;
    for (j in 1:(p - 1)) {
      S[j,j] = 0;
      for (i in (j + 1):p) {
        S[i,j] = s[pos];
        S[j,i] = -s[pos];
        pos += 1;
      }
    }
    S[p,p] = 0;
  }
}
model {
  matrix[p,p] IS = S;
  for (i in 1:(p - 1)) {
    real half_i = 0.5 * i;
    target += -half_i * log_pi + lgamma(kappa + i) - lgamma(kappa + half_i);
    IS[i,i] = 1;
  }
  IS[p,p] = 1;  
  target += -(kappa + p - 1) * log_determinant(IS);
}
generated quantities {
  matrix[p,p] P = 2 * inverse(I - S) - I; // rotation matrix
}

then it seems to be fine if kappa is sufficiently far away from zero. If kappa = 0, then the BFMI is low and you have massive funnels. On the other hand, if you have data then you may be able to get away with kappa = 0, which is what would be assumed if we introduced rotation_matrix[p] P; as a type in the Stan language.