Efficient orthogonal matrix parameterization

I’m trying to parameterize orthogonal matrix via a series of Givens rotations as follows

data{
  int TT;    //TT * TT matrix
}
transformed data{
  int dim_theta_i2toQ=TT*(Q-1)-(Q-1)*(Q+2)/2;   //dimentionality of dim_theta_i2toQ
}
parameters {
  vector<lower=-pi(),upper=pi()>[TT-1] theta_i1;             //theta[1,j]; j=2,...,TT
  vector<lower=0,upper=pi()>[dim_theta_i2toQ] theta_i2toQ;   //theta[i,j]; i=2,...,TT-1; j=i+1,...,TT
}   //There is bijection between theta=[theta_i1,theta_i2toQ] (theta_i1 is in [-pi,pi] and theta_i2 is in [0,pi]) and TT*TT orthogonal matrix U

transformed parameters {
  matrix[TT,TT] U_transpose=diag_matrix(rep_vector(1,TT));

  {
    int index=1;
    real cos1;
    real sin1;
    vector[TT] Ui_tmp;
    vector[TT] Uj_tmp;
  
    for(i in 1:(TT-1)){
      for(j in (i+1):TT){
        cos1=cos(theta_i1[index]);
        sin1=sin(theta_i1[index]);
        Ui_tmp=cos1*U_transpose[:,i]-sin1*U_transpose[:,j];
        Uj_tmp=sin1*U_transpose[:,i]+cos1*U_transpose[:,j];
        U_transpose[:,i]=Ui_tmp;
        U_transpose[:,j]=Uj_tmp;
        index+=1;
      }
      index=1;
    }else{
      for(j in (i+1):TT){
        cos1=cos(theta_i2toQ[index]);
        sin1=sin(theta_i2toQ[index]);
        Ui_tmp=cos1*U_transpose[:,i]-sin1*U_transpose[:,j];
        Uj_tmp=sin1*U_transpose[:,i]+cos1*U_transpose[:,j];
        U_transpose[:,i]=Ui_tmp;
        U_transpose[:,j]=Uj_tmp;
        index+=1;
      }
    }
  }
}

Mathematics about orthogonal matrix parameterization [1]:

This is part of my program, and I found it runs slowly. Could this orthogonal matrix parameterization be more efficient?

Thanks.

Reference:
[1] Matteson, D. S., & Tsay, R. S. (2017). Independent component analysis via distance covariance. Journal of the American Statistical Association , 112 (518), 623-637.

It doesn’t run slow because it takes a long time to evaluate those loops; it runs slowly because the posterior geometry has a lot of curvature. It probably would be fair to say that we do not know a good transformation into an orthogonal matrix. There are more options (but not necessarily good ones) if you can restrict yourself to rotation matrices.

Thanks Ben.

When I run with warmup=0, it takes very short time.
However, when warmup>=1, the time consumption increases super-linearly:

warmup=10, iter=20, adapt_delta=0.99, 4 chains in parallel 11.587s
warmup=100, iter=200, adapt_delta=0.99, 4 chains in parallel 1022.823s. Many parameters have Rhat>10.
warmup=1000, iter=2000, adapt_delta=0.99, 4 chains in parallel 7860.795s (about 131 minutes). All parameters have Rhat within (0.9992, 1.0068)

Does it indicate that my code is computational good but the model’s statistical efficiency can be improved?

For the final case (warmup=1000, iter=2000, adapt_delta=0.99, 4 chains in parallel), I strive to shorten the time, while keeping Rhat within 1.05 or 1.1. Do you think it a good idea to use less iterations (e.g. warmup=500, iter=1000) or smaller adapt_delta?

Thank you.

All of that indicates that it is a difficult posterior distribution to draw from, so the first priority has to be getting that right, rather than the wall time that it takes to do it.

You have to have a positive warmup value (several hundred at least) to be able to adapt to the posterior distribution that you are drawing from so that you can get a good enough effective sample size to reliably estimate what you are interested in (assuming the mixing is fast enough for the effective sample size estimate to be a well-defined).

You have to eliminate the divergent transitions and often that implies increasing adapt_delta or finding a better parameterization. And we don’t know a universally better parameterization, so you might very well be stuck with waiting two hours.

Dear, it is hard to figure out statistically more efficient parameterization. I just parameterized directly and this parameterization is at least correct.

In addition, the effective sample size of all parameters are above 1000 (1000 * 4 samples are kept), Rhat<1.01, as attached.

Thanks.

GMMBLCA_unconstrained_fixC_noZ_StanResult20190501.txt (11.5 KB)

There is no parameterization of the space of orthogonal matrix into the real numbers of the same dimension. The Givens rotations define a coordinate system that covers most, but not all, of the orthogonal matrix space as there is a singular line where the coordinates are undefined.

The same thing happens with a circle – you parameterize all but one point of a circle with angles [-pi, pi] but to parameterize the whole thing you need to consider another overlapping parameterization, say with angels [0, 2pi].

This is a basic feature of trying to work with a space that has a topology that isn’t the same as the real numbers, for example spheres, torii, and Steifel manifolds (which are the spaces of orthogonal matrices). In other words you can’t really build a model over orthogonal matrices the way Stan is currently configured.

@betanalpha I think [-pi, pi), (-pi, pi], (0, 2pi], [0, 2pi) cover the whole circle, but cannot reflect the fact that -pi+eps and pi-eps are very close in the circle for small eps>0. Thank you for your enlightenment.

\pi - \epsilon and \pi + \epsilon are far away in the parameterization but close in the circle. That inability of the parameterization to faithfully characterize “closeness” in the original space is one indication of the problem.

To more deeply understand the issue you’ll have to dip your toes into some differential geometry. Wikipedia has some oaky introductions, see for example https://en.wikipedia.org/wiki/Manifold.

We have recently explored an alternative parameterization based on Householder reflections. It is similar to the way that Stan represents a unit circle, i.e. instead of mapping it to angles (which has the problem that Betancourt explained) an over-parameterized full vector is used and then normalized. In our case, a set of such vectors are used and then transformed onto the space of orthogonal matrices via a sequence of Householder reflections. The corresponding Stan code for Bayesian PCA using this method can be found here.
Feel free to try if that works any better in your case.

1 Like

@bertschi Great.
Let me try.
Thanks a lot.

I arXived a paper on this topic recently https://arxiv.org/pdf/1906.07684.pdf

Essentially, to simulate from the distribution of a random orthogonal matrix Q, we define a distribution for a real random matrix X such that the orthogonal component of the polar decomposition of X is equal in distribution to Q. This can be viewed as a form of parameter expansion. For a particular choice of distribution for the expanded parameters, this approach looks a lot like what Stan does to simulate from the unit sphere. See Equation 4.

*I should add that there is a typo directly following Equation 4. The argument of the density should be Q, not Q_X.

4 Likes

Thank you @mjauch.

My model includes orthogonal matrix and also other parts. The total time is long. How can we figure out where the bottleneck locates: orthogonal matrix or other parts; statistical or computational?

Thank you.

Thanks @mjauch for bringing this to our attention. I’ll have to read it more closely, but it looks interesting. I have a branch that implements the matrix square root for the PSD case, so maybe I can add some other ways to do a polar decomposition.

@bertschi Are you sure Householder does not have the issue that Given rotation has? Does Householder work efficiently for your research?
Thank you.

@bertschi, we are trying to use your implementation for creating a space of orthogonal matrices for building a covariance matrix for hierarchical modeling, but we get initialization errors. Do you have any suggestions what to try to overcome this? We also tried multivariate reparameterization, but it didn’t help either.

Are not following the Stan forum very regularly at the moment, but just saw your question. Could you provide some more details or a small example reproducing the problem?

Hi, @bertschi. The parameterization as coded in your GitHub repo isn’t ideal for Stan because of the autodiff. For example, consider this function.

matrix Householder (int k, matrix V) {
  // Householder transformation corresponding to kth column of V
  int D = rows(V);
  vector[D] v = V[, k];
  matrix[D, D] H;
  real sgn = sign(v[k]);
  v[k] +=  sgn;
  H = diag_matrix(rep_vector(1, D)) - (2.0 / dot_self(v)) * (v * v');
  H[k:, k:] = -sgn*H[k:, k:];
  return H;
}

This winds up generating a a lot of zeroes in H when (D - k)^2 + D \ll D^2 and it looks like k loops over large values. These aren’t so bad for straight up linear algebra, but are a lot worse with autodiff as we wind up looping over all these values again in interpreted mode during the reverse pass of autodiff.

It’s much more efficient to loop and add 1 to the diagonal rather than to create a diag_matrix. In autodiff, this gets filled with a quadratic number of entries that all create autodiff overhead.

It also looks like it’s going to be inefficient to multiply the chain of full Householder matrices.

I thought there were more efficient ways to do just the low-rank Householder update, but the Anderson, Olkin and Underhill (1985) paper based on Stewart (1980) that lays out this transform doesn’t say much before going onto Givens rotations.