Parameterizing an orthonormal matrix (the Stiefel manifold)

I’ve been digging into ways to parameterize an orthonormal matrix. Here’s a simple approach based on the QR decomposition in which the variable \beta is distributed uniformly over K \times N orthonormal matrices (the N columns are pairwise orthogonal and unit length (i.e., \beta^\top \cdot \beta = \textrm{I}).

The approach is straightforward. Generate a K \times N matrix of independent standard normal variates, QR factor it, and the resulting Q matrix is your uniformly distributed orthonormal matrix.

ortho-qr.stan

data {
  int<lower=1> K;                        // dim of orthogonal vecs
  int<lower=1, upper=K> N;               // # orthogonal vecs
}
parameters {
  matrix[K, N] alpha;
}
transformed parameters {
  matrix[K, N] beta = qr_thin_Q(alpha);  // uniform orthonormal
}
model {
  to_vector(alpha) ~ normal(0, 1);
}
generated quantities {
  matrix[N, N] I = beta' * beta;         // identity
}

The generated quantities block is just a check on orthornormality, which we can see passes by running in CmdStanPy:

import cmdstanpy as csp
model = csp.CmdStanModel(stan_file = 'ortho-qr.stan')
fit = model.sample(data = {'N': 3, 'K': 5})
fit.summary()

The output verifies the orthonormality:

                Mean      MCSE        StdDev            5%       50%           95%     N_Eff   N_Eff/s     R_hat
lp__       -7.530620  0.072056  2.790200e+00 -1.272850e+01 -7.157700 -3.645840e+00   1499.44   6021.85  1.001490
alpha[1,1]  0.001198  0.011320  9.979420e-01 -1.625390e+00 -0.022932  1.667450e+00   7771.42  31210.50  0.999248
alpha[1,2] -0.013111  0.010082  1.022400e+00 -1.696020e+00 -0.022433  1.638130e+00  10284.20  41302.00  0.999890
alpha[1,3]  0.008322  0.010008  9.599100e-01 -1.543770e+00  0.019685  1.580010e+00   9200.09  36948.10  0.999080
alpha[2,1]  0.006395  0.011304  9.983870e-01 -1.629180e+00  0.002607  1.669930e+00   7800.70  31328.10  0.999640
alpha[2,2] -0.000308  0.011192  1.042710e+00 -1.737230e+00 -0.010869  1.695580e+00   8679.69  34858.20  0.999508
alpha[2,3]  0.000019  0.010190  1.007820e+00 -1.663040e+00  0.010653  1.630600e+00   9782.29  39286.30  0.999385
alpha[3,1] -0.007449  0.010033  1.002010e+00 -1.656830e+00 -0.029778  1.664730e+00   9974.33  40057.60  0.999275
alpha[3,2]  0.008401  0.010177  9.883600e-01 -1.615680e+00  0.003338  1.640160e+00   9431.65  37878.10  0.999213
alpha[3,3] -0.005065  0.009752  9.922030e-01 -1.642460e+00  0.003125  1.620050e+00  10351.10  41570.50  0.999351
alpha[4,1] -0.004184  0.010482  9.996200e-01 -1.645750e+00 -0.007468  1.662770e+00   9095.02  36526.20  0.999319
alpha[4,2]  0.000769  0.009982  1.000090e+00 -1.690680e+00 -0.003663  1.641520e+00  10037.90  40313.00  0.999094
alpha[4,3]  0.000399  0.010426  9.956700e-01 -1.655840e+00 -0.006336  1.657750e+00   9120.13  36627.00  0.999352
alpha[5,1] -0.007562  0.010794  1.025450e+00 -1.699320e+00 -0.007088  1.698940e+00   9025.03  36245.10  0.999235
alpha[5,2]  0.006172  0.010159  1.007470e+00 -1.659660e+00  0.012777  1.659950e+00   9834.58  39496.30  0.999584
alpha[5,3]  0.003730  0.010721  9.896660e-01 -1.637200e+00  0.002523  1.617940e+00   8521.89  34224.40  0.999434
beta[1,1]  -0.003341  0.005048  4.434700e-01 -7.390780e-01 -0.011343  7.318480e-01   7718.13  30996.50  0.999193
beta[1,2]  -0.006083  0.004530  4.551550e-01 -7.330580e-01 -0.015444  7.420130e-01  10093.40  40535.90  1.000340
beta[1,3]   0.001120  0.005151  4.476690e-01 -7.311930e-01  0.000670  7.415760e-01   7553.58  30335.70  0.999357
beta[2,1]   0.006547  0.004872  4.449230e-01 -7.086050e-01  0.001480  7.245850e-01   8338.67  33488.60  0.999627
beta[2,2]   0.004741  0.005034  4.523190e-01 -7.313870e-01  0.013392  7.362360e-01   8073.45  32423.50  0.999309
beta[2,3]   0.000025  0.005196  4.499990e-01 -7.325200e-01  0.000652  7.365650e-01   7501.13  30125.00  0.999172
beta[3,1]  -0.001154  0.004481  4.483140e-01 -7.332880e-01 -0.015290  7.465450e-01  10009.50  40198.80  0.999285
beta[3,2]   0.004010  0.004769  4.413500e-01 -7.192240e-01  0.003180  7.160750e-01   8566.30  34402.80  0.999191
beta[3,3]   0.004710  0.005236  4.452560e-01 -7.361150e-01  0.014167  7.171020e-01   7231.05  29040.40  0.999591
beta[4,1]  -0.003018  0.004727  4.440650e-01 -7.174590e-01 -0.003262  7.354320e-01   8825.57  35444.10  0.999422
beta[4,2]   0.000092  0.004798  4.421800e-01 -7.366790e-01 -0.002647  7.219420e-01   8493.95  34112.30  0.999272
beta[4,3]   0.001037  0.005345  4.522030e-01 -7.334000e-01  0.003401  7.324580e-01   7156.91  28742.60  0.999687
beta[5,1]  -0.003491  0.004682  4.553810e-01 -7.452820e-01 -0.002727  7.430380e-01   9459.52  37990.00  0.999317
beta[5,2]  -0.002081  0.004678  4.450820e-01 -7.287510e-01  0.003087  7.175080e-01   9052.62  36355.90  0.999960
beta[5,3]  -0.000743  0.005355  4.411090e-01 -7.207330e-01 -0.002116  7.162440e-01   6785.59  27251.40  0.999485
I[1,1]      1.000000       NaN  6.000000e-16  1.000000e+00  1.000000  1.000000e+00       NaN       NaN       NaN
I[1,2]     -0.000000  0.000000  0.000000e+00 -1.000000e-16  0.000000  1.000000e-16   3865.72  15525.00  0.999975
I[1,3]      0.000000  0.000000  0.000000e+00 -1.000000e-16  0.000000  1.000000e-16   3832.17  15390.20  1.000350
I[2,1]     -0.000000  0.000000  0.000000e+00 -1.000000e-16  0.000000  1.000000e-16   3865.72  15525.00  0.999975
I[2,2]      1.000000       NaN  6.000000e-16  1.000000e+00  1.000000  1.000000e+00       NaN       NaN       NaN
I[2,3]     -0.000000  0.000000  1.000000e-16 -1.000000e-16  0.000000  1.000000e-16   3792.03  15229.00  1.000110
I[3,1]      0.000000  0.000000  0.000000e+00 -1.000000e-16  0.000000  1.000000e-16   3832.17  15390.20  1.000350
I[3,2]     -0.000000  0.000000  1.000000e-16 -1.000000e-16  0.000000  1.000000e-16   3792.03  15229.00  1.000110
I[3,3]      1.000000       NaN  6.000000e-16  1.000000e+00  1.000000  1.000000e+00       NaN       NaN       NaN

I’m just getting started on this, so I’d be happy to get suggestions on alternative or better ways to generate uniform distributions over the Stiefel manifold (the set of orthonormal matrices). The QR is supposedly more stable than just applying Gram-Schmidt directly. There’s also the option to use Givens rotations as @Arya et al. did.

Arya A. Pourzanjani, Richard M. Jiang, Brian Mitchell, Paul J. Atzberger, and Linda R. Petzold. 2020. Bayesian Inference over the Stiefel Manifold via the Givens Representation. Bayesian Analysis.

5 Likes

Hi Bob, this indeed works because Stan’s QR uses the sign convention that the diagonal of R is positive.

K \times N semi-orthogonal matrices live on the K N - N(N+1)/2 -dimensional Stiefel manifold, so the QR parameterization uses N(N+1)/2 extra parameters. I have an alternative parameterization that uses only N extra parameters but retains the nice numerical properties (the N parameters can also be eliminated, but I haven’t explored that geometry yet). I plan to publish it eventually, so I’ll send it you privately within the next few days. Although, if your N is generally low, the difference in performance will be minimal.

3 Likes

Thanks. What is the performance of this approach?

I don’t know if your question was directed towards me, but the Householder parameterization in Rotation Invariant Householder Parameterization for Bayesian PCA is essentially the same approach (though numerically stable implementation requires LAPACK-like subroutines). It has a unit Jacobian determinant, so geometrically, sampling from the invariant distribution of a semi-orthogonal matrix is no different from an IID standard normal distribution. But it does have the usual singularity at 0 that occurs when sampling a vector and normalizing it, and when N=K, the smallest such vector is length 1, so the problems raised in A better unit vector of divergences for low-dimensional unit vectors could be quite bad. There’s a relatively straightforward modification in this case that improves the geometry.

In terms of number of operations, generating a dense semi-orthogonal matrix from this representation would require I think \mathcal{O}(K^2 N) operations, which is the sample complexity as QR decomposition itself, but with a lower constant. In principle one never needs to form the dense matrix through, since LAPACK-like routines can efficiently operate on the factorized representation that can be generated from the unconstrained parameterization in O(K N) operations.

To my knowledge no empirical comparison of the performance of this method with other parameterizations has been done. I’m working on this.

Thanks. I guess I was asking Bob, but your answer was very helpful regardless.

The numerical precision and efficiency will depend on how well QR is implemented (e.g., with or without Householder transforms). I’m not entirely sure what’s going on under the hood with Stan’s implementation here. This was really intended more as a proof of concept.

If we restrict attention to square orthonormal matrices, we generate uniformly over the orthogonal group

\textrm{O}(N) = \left\{ x \in \mathbb{R}^{N \times N} \,\big|\, x^{\top} \cdot x = \textrm{I}_N \right\}.

There are two components to this group, the rotations and the reflections. If x \in \textrm{O}(N) is such that \det x = 1, then x is a rotation; if \det x = -1, then $x is a reflection. The special orthogonal group arises if we restrict the determinants to 1,

\textrm{SO}(N) = \left\{ x \in \mathbb{R}^{N \times N} \,\big|\, x^{\top} \cdot x = \textrm{I}_N \ \textrm{and} \ \det x = 1 \right\}.

We can generate elements of \textrm{SO}(N) using the following Stan program. It relies on the trick that swapping two columns of an orthonormal matrix negates the sign of the determinant.

data {
  int<lower=2> N;               // dim and # orthogonal vecs
}
parameters {
  matrix[N, N] alpha;
}
transformed parameters {
  matrix[N, N] beta = qr_thin_Q(alpha);  // uniform orthonormal
  if (determinant(beta) < 0) {           // ensure rotation
    vector[N] temp = beta[ , 1];
    beta[ , 1] = beta[ , 2];
    beta[ , 2] = temp;
  }    
}
model {
  to_vector(alpha) ~ normal(0, 1);
}
generated quantities {
  real d_alpha = determinant(alpha);  // can be -1 or 1
  real d_beta = determinant(beta);  // should be 1
  matrix[N, N] I_beta = beta' * beta;  // should be identity
}

We check the determinant in transformed parameters and swap the first and second column if the determinant is below zero.

Let’s see if it works for everyone’s favorite group, the 3D rotations, aka \textrm{SO}(3).

              Mean   MCSE        StdDev            5%     50%           95%   N_Eff  N_Eff/s  R_hat
beta[1,1]   -0.004  0.009  6.000000e-01 -9.000000e-01  0.0030  9.000000e-01  3724.0  16119.0    1.0
beta[1,2]    0.020  0.010  6.000000e-01 -9.000000e-01  0.0200  9.000000e-01  3403.0  14733.0    1.0
beta[1,3]   -0.007  0.008  6.000000e-01 -9.000000e-01 -0.0090  9.000000e-01  4993.0  21616.0    1.0
beta[2,1]    0.003  0.009  6.000000e-01 -9.000000e-01  0.0080  9.000000e-01  3818.0  16530.0    1.0
beta[2,2]   -0.005  0.009  6.000000e-01 -9.000000e-01 -0.0200  9.000000e-01  3654.0  15817.0    1.0
beta[2,3]    0.008  0.008  6.000000e-01 -9.000000e-01  0.0090  9.000000e-01  5187.0  22453.0    1.0
beta[3,1]    0.010  0.010  6.000000e-01 -9.000000e-01  0.0400  9.000000e-01  3416.0  14790.0    1.0
beta[3,2]   -0.007  0.010  6.000000e-01 -9.000000e-01 -0.0100  9.000000e-01  3541.0  15329.0    1.0
beta[3,3]   -0.002  0.008  6.000000e-01 -9.000000e-01  0.0050  9.000000e-01  5496.0  23793.0    1.0
d_alpha     -0.000  0.000  2.500000e+00 -4.000000e+00 -0.0000  3.800000e+00  6094.2  26381.8    1.0
d_beta       1.000    NaN  0.000000e+00  1.000000e+00  1.0000  1.000000e+00     NaN      NaN    NaN
I_beta[1,1]  1.000    NaN  7.000000e-16  1.000000e+00  1.0000  1.000000e+00     NaN      NaN    NaN
I_beta[1,2]  0.000  0.000  1.000000e-16 -2.000000e-16  0.0000  2.000000e-16  3976.0  17210.0    1.0
I_beta[1,3]  0.000  0.000  1.000000e-16 -2.000000e-16  0.0000  2.000000e-16  3726.0  16130.0    1.0
I_beta[2,1]  0.000  0.000  1.000000e-16 -2.000000e-16  0.0000  2.000000e-16  3976.0  17210.0    1.0
I_beta[2,2]  1.000    NaN  7.000000e-16  1.000000e+00  1.0000  1.000000e+00     NaN      NaN    NaN
I_beta[2,3] -0.000  0.000  1.000000e-16 -2.000000e-16  0.0000  2.000000e-16  3996.0  17298.0    1.0
I_beta[3,1]  0.000  0.000  1.000000e-16 -2.000000e-16  0.0000  2.000000e-16  3726.0  16130.0    1.0
I_beta[3,2] -0.000  0.000  1.000000e-16 -2.000000e-16  0.0000  2.000000e-16  3996.0  17298.0    1.0
I_beta[3,3]  1.000    NaN  7.000000e-16  1.000000e+00  1.0000  1.000000e+00     NaN      NaN    NaN

This is 4000 sampling iterations across 4 chains and it appears to work in that it’s sampling roughly independently.

I worry about the discontinuity of swapping columns and would be curious if the beta would be useful for inference in the sense of concentrating around a single model in the unconstrained space where we sample. For example, we might use a parameterization like this for the change in bearing update in a time series for the movements of a flying, swimming, or burrowing animal or to model the rotational axis of a spinning basketball or to represent the yaw, pitch, and roll of an airplane.

1 Like

Yes, I suspect swapping will indeed cause a discontinuity. In the Householder parameterization, you would sample from SO(K) by fixing a single parameter based on the previous values (but it would always be the same value except at an existing discontinuity). The result is equivalent to swapping the sign of the final column of the matrix if the determinant is negative. This should be smooth in this parameterization, but it might cause a discontinuity if sampling the full K \times K matrix, as it would correspond to the final column of your dense matrix alpha having a sign swap.

And in case anyone does not want to dig through the paper, the code for the Householder parameterization is available here: GitHub - rsnirwan/HouseholderBPCA: Rotation Invariant Householder Parameterization for Bayesian PCA

That implementation works by explicitly constructing N Householder matrices of size K \times K and then multiplying these together. This is fine for demonstrating good sample efficiency in terms of ESS/nsteps, but efficient implementations of Householder transformations never construct a Householder matrix directly and instead compute compute a much more efficient product directly from the reflector parameters. For small K and N the linked approach is probably fine, but when K >> N, QR decomposing a dense K \times N matrix as Bob does in the OP will probably be much faster in terms of ESS/s (assuming Stan’s implementation of QR does not explicitly form Householder matrices).

Here’s a Julia implementation of the efficient approach for Turing.jl that works with implicit reflectors (LAPACK’s generalization of Householder transformations): Demo of a bijective transform to a QR factorization using elementary reflectors · GitHub. It represents the Q factor in terms of LAPACK’s factorized representation (Julia has a native object that represents this), but it’s straightforward to construct a dense Q matrix from this (using LAPACK’s DORGRQ). I don’t think it would be complicated to port the code to C++ for Stan, but more experiments comparing to other transforms are necessary first. I don’t know if the Stan modeling language supports the in-place matmuls needed to keep allocations low, but even if not, it will be faster than constructing Householder matrices explicitly.

1 Like

I’ve posted at GitHub - sethaxen/stan_semiorthogonal_transforms a Stan implementation of a transform using reflectors that avoids directly constructing reflector matrices, along with some example models that use different transforms:

  • full.stan: QR decomposing a full matrix as done in the OP (Stan uses Eigen’s HouseholderQR approach)
  • reflectors.stan: constructing an orthogonal matrix from reflectors
  • reflectors_factors.stan: constructing an orthogonal matrix from reflectors but representing it in terms of the reflector factors
  • householder.stan: constructing Householder reflectors as dense matrices that are multiplied. (adapted from https://github.com/rsnirwan/HouseholderBPCA/blob/master/py_stan_code/ppca_house.stan. Not included due to lack of license)

And some benchmark results drawing 1000 draws each from 4 chains:
N=K=3 (100 runs)

Model Time (s) ESS ESS/s ESS/nleapfrog
full 0.08162 6292.1 36552.7 0.247655
householder 0.10652 4206.91 18469.3 0.184814
reflectors 0.13401 4192.07 14696.3 0.184204
reflectors_factors 0.07353 4192.07 26617.6 0.184204

N=K=10 (100 runs)

Model Time (s) ESS ESS/s ESS/nleapfrog
full 0.99254 5045.11 2450.11 0.141223
householder 1.72863 4048.65 1171.0 0.14366
reflectors 0.79541 4411.44 2462.89 0.156555
reflectors_factors 0.36824 4411.44 4926.62 0.156555

N=K=20 (10 runs)

Model Time (s) ESS ESS/s ESS/nleapfrog
full 8.1581 8160.53 464.858 0.136009
householder 21.6175 4089.65 88.6498 0.0712475
reflectors 5.132 4789.26 403.603 0.0831286
reflectors_factors 1.4957 4789.26 1145.05 0.0831286

N=K=30 (10 runs)

Model Time (s) ESS ESS/s ESS/nleapfrog
full 24.6639 6456.95 133.978 0.107616
householder 109.209 4109.39 19.5075 0.0684898
reflectors 11.889 5203.4 204.009 0.0867233
reflectors_factors 2.4767 5203.4 780.709 0.0867233

N=K=40 (10 runs)

Model Time (s) ESS ESS/s ESS/nleapfrog
full 79.1754 5322.27 38.1727 0.0868071
householder 232.552 3993.55 9.03856 0.0665591
reflectors 29.5462 4686.4 79.7552 0.0781067
reflectors_factors 4.4033 4686.4 396.463 0.0781067

N=K=50 (10 runs)

Model Time (s) ESS ESS/s ESS/nleapfrog
full 590.552 5035.86 5.14641 0.0646216
householder 500.663 3919.16 4.34798 0.0653193
reflectors 78.2099 4107.16 28.9358 0.0684527
reflectors_factors 6.6903 4107.16 235.338 0.0684527

Note that we’re not doing anything with these orthogonal matrices. The latent geometry in every case is an IID normal, so the speed differences are entirely from cost to compute the linear algebra on whatever number type is passed to it. This is reflected in the ESS/nleapfrog, which don’t vary much across all models. ESS is computed as the mean ESS for all entries in the orthogonal matrix.

The main observations when K=N are that using a product of dense reflector matrices (householder) is for all but the smallest orthogonal matrices the worst approach. Next comes QR-decomposing a full matrix. As expected, this has the best geometry, so it generally has the best ESS/nleapfrog post-warmup. (EDIT: I don’t think this argument makes sense since the geometry is IID normal for them all.) Using the implicit action of reflectors to build a dense matrix (reflectors) is much faster, since we both reduce the number of parameters and avoid the initial QR decomposition while still avoiding dense matrix products.

Finally, reflectors_factors imagines what if Stan added a native semiorthogonal_matrix[K,N] data type that stored the reflector factors and for which certain operations like multiplication, determinant, and linear solve were implemented in terms of those factors. Efficient algorithms exist for each of these and are implemented in Eigen and LAPACK. This is really the only option with these Householder-inspired transforms if one needs to efficiently infer a large orthogonal matrix.

For K >> N, the results will of course be different, as for a unit vector, all approaches converge to the same.

1 Like

Thanks for the detailed comparison Seth!

Additionally the built in QR has the derivatives hand written. I imagine if the reflector_factor method were added to Stan math with hand written derivatives we’d see additional speed ups.

Ah, where is that defined in stan/math?

My bad, I assumed it was in but looks like it’s not. It would be in the rev folder. There is svd at https://github.com/stan-dev/math/blob/develop/stan/math/rev/fun/svd.hpp.

Ah that makes sense. SVD is a little different since IIRC the standard algorithms produce the full U and V matrices directly, not in a factorized form. So it’s a lot easier to write the AD rule, and that rule is independent of the actual SVD algorithm. For QR avoiding dense matrices it would be tricker. It would be easy to hand-write a rule for applying and building reflectors if necessary, but I suspect efficiently differentiating those algorithms that use the reflectors would just be writing out by hand what the AD would otherwise do automatically.

For what it’s worth, it’s straightforward to write down AD rules for QR in terms of dense matrices
: [2009.10071] QR and LQ Decomposition Matrix Backpropagation Algorithms for Square, Wide, and Deep -- Real or Complex -- Matrices and Their Software Implementation