# Generating orthogonal latents

I’m playing with SEMs and have one latent, B, that needs to be orthogonal to another latent, A, where A has already been “generated” by earlier non-trivial structures in the model, so I need a way to express something like

A = … ; // complicated dependency on other latents
B_helper ~ std_normal() ;
B = f(A, B_helper) ;


Where the f function is whatever makes B optimally orthogonal to A. I’ve actually already implemented a candidate for f where I simply do a regression of B_helper predicted by A and obtain the residuals, but that feels very ad-hoc so I thought I’d query if there’s a more standard was to do this. I feel like the kinds of rotation common in PCA or maybe the QR transform might be relevant? Thoughts?

2 Likes

Given a vector in \mathbb{R}^n, the vectors orthogonal to it will all be confined to an n - 1 dimensional hyperplane embedded in \mathbb{R}^n. For example, given a two-dimensional vector like (1, 0), the orthogonal vectors will all be confined to a one-dimensional line along the (positive or negative) direction of (0, 1). Thus, one way to sample an orthogonal vector is to

1. rotate the coordinate space so that the first vector points directly along the x-axis,
2. then sample under the constraint that the first element of the new vector is zero, and
3. then rotate back to the original coordinates.

Thus, you could sample a vector with the first element zero, and then transform it via the rotation matrix that corresponds to the “rotate it back” operation, i.e. the inverse of the rotation matrix needed in step 1. Note that the rotation matrix that achieves step 1 is not unique, because after rotating to align with the x-axis you can “twist” space around the x axis without messing up the alignment. But it should be fine to pick one particular rotation.

1 Like

Ha, I was just checking out videos on rotations as you posted. Unfortunately I’m very naive in the matrix algebra realm; any chance you’d be able to provide the corresponding Stan code? I think it would start with:

data{
int n ;
}
parameters{
vector[n] a ;
vector[n-1] b_helper_1 ;
}
model{
a ~ std_normal() ;
b_helper ~ std_normal() ;
vector[n] b_helper_2 = append_row([0]',b_helper1) ;
matrix[n,n] rotation_matrix = g(a) ;
vector[n] b = b_helper_2 * rotation_matrix ;
}


But I’m unsure what function g would be precisely; given your response, presumably it first finds the rotation matrix X wherein the first element of the output of a*X is zero, then returns the inverse of X? If I have that right, I’m still not sure what operations would find X.

1 Like

I don’t have the chops to confidently write it down quickly, and I don’t have the time to write it down carefully. This answer, which discusses the connection to the QR decompostion, might be useful. matrices - How can I calculate a $4\times 4$ rotation matrix to match a 4d direction vector? - Mathematics Stack Exchange

1 Like

I think an alternative to @jsocolar’s approach would be to use that for a \in \mathbb{R}^N \setminus 0 and b \sim \mathsf{normal}(0, \mathbf{I}_N) then c = b - (b^{\mathsf{T}} a) / (a^{\mathsf{T}} a) a is almost surely orthogonal to a (this can be seen from c^{\mathsf{T}}a = b^{\mathsf{T}} a - (b^{\mathsf{T}} a) / (a^{\mathsf{T}} a) (a^{\mathsf{T}} a) = 0 ).

Alternatively if A is the 1 \times N matrix with a \in \mathbb{R}^N \setminus 0 as it single row, then if we perform a singular value decomposition U, S, V^{\mathsf{T}} = \mathsf{svd}(a) such that V is a N \times N orthogonal matrix and S a 1 \times N matrix with the single non-zero singular value in its leftmost column and all other entries zero, then if we set W as the N \times (N-1) matrix corresponding to all columns of V except the leftmost, then the columns of W are a basis for the N - 1 dimensional linear subspace orthogonal to a and if b \sim \mathsf{normal}(0, \mathbb{I}_{N-1}) then c = W b will be orthgonal to a.

1 Like

The modified Gram-Schmidt process produces an orthogonal matrix. Maybe this helps?

data {
int<lower=1> p; // number of parameters
}
transformed data {
int q = (p * (p - 1)) %/% 2;
}

parameters {
vector<lower=0>[p] diag_ele;
vector[q] off_diag;
}

transformed parameters {
matrix[p, p] Q; // orthogonal matrix
array[p] vector[p] u; // columns of Q
matrix[p, p] L;

L[1, 1] = diag_ele[1];
{
int count = 0;
for (j in 2:p) {
L[j, j] = diag_ele[j];
for (k in 1:j - 1) {
count += 1;
L[j, k] = off_diag[count];
L[k, j] = 0;
}
}
}

// modified Gram-Schmidt process
for (j in 1:p) {
u[j] = L[j]';
for (i in 1:(j-1)) {
u[j] = u[j] - (Q[i] * L[j, i])';
}
Q[j] = u[j]' / sqrt(dot_self(u[j]));
}
}

model {
// prior on L
off_diag ~ std_normal();
diag_ele ~ lognormal(0, 1);
}
generated quantities {
matrix[p, p] P1 = tcrossprod(Q);
matrix[p, p] P2 = crossprod(Q);
}


If I understand correctly, your code produces two matrices whose respective vectors are orthogonal to one another? I don’t think this helps for my scenario as described above given that one of my vectors is the result of other model structures.

Would this be implemented as:

data{
int n ;
}
parameters{
vector[n] a ;
vector[n] b ;
}
model{
a ~ std_normal() ;
b ~ std_normal() ;
vector[n] c = b - (b'*a)/(a'*a)*a ;
}


?

And,

would be implemented as:

data{
int n ;
}
parameters{
vector[n] a ;
vector[n-1] b ;
}
model{
a ~ std_normal() ;
b ~ std_normal() ;
vector[n] v = svd_V(a) ;
matrix[n,n-1] w = v[,2:n] ;
vector[n] c = w*b ;
}


?

Presumably they’re formally equivalent but the latter should be faster than the former computationally?

Just one matrix that where each column is orthogonal.

@matt-graham is it obvious that these constructions yield vectors with uniformly distributed directions within the orthogonal (hyper)plane? Asking out of genuine curiosity, definitely not out of a presumption that they might not!

Edit: part of what I like about “my” construction is that it becomes super transparent to set a prior on the orthogonal vector in a space where we don’t have to worry about how the orthogonality constraint interacts with the prior in potentially nonintuitive ways (i.e. in the rotated space where we are setting a prior on an n-1 dimensional vector). Perhaps your methods are equally transparent to somebody who can grok them a bit faster than I.

1 Like

Yes that looks right to me.

I am not super familiar with Stan’s notation for indexing matrices - I would have probably though something like matrix[n, n-1] w = v[:, 2:n]; instead of matrix[n, n-1] w = v[, 2:n]; based on my prior from using Python mostly. I think svd_V also only accepts matrix arguments so you would possibly need something like

  matrix[1, n] A;
A[1, :] = a;
vector[n] v = svd_V(A);


In terms of the ‘forward’ computation involved in simulating from the model or evaluating the (log) joint density and its gradient, then actually the former I think should be the faster as it only involves \mathcal{O}(N) vector operations compared to the latter in which the SVD should be \mathcal{O}(N^2) for a N \times 1 matrix. The main benefit of the latter is that it avoids using an overparametrized representation, with the former approach projecting a N dimensional vector on to a N - 1 dimensional subspace, which by introducing a non-identifiability may lead to a more challenging posterior geometry.

For the question of equivalence - yes I believe the distributions of c in both the cases I described should be the same. In both cases they are linear transformations of Gaussian distributed vectors and so will be Gaussian (if we allow the definition of Gaussian distributions to be extended to those supported on linear subspaces / with singular covariance matrices).

For the case c = b^\mathsf{T} - (b^\mathsf{T} a) / (a^\mathsf{T} a) a, using index notation with repeated indices indicating summation and that \mathbb{E}[b_i] = 0, \mathbb{E}[b_i b_j] = \delta_{ij}

\mathbb{E}[c_i] = \mathbb{E}[b_i - (b_j a_j)/(a_k a_k) a_i] = 0
\mathbb{E}[c_i c_j] = \mathbb{E}[(b_i - (b_k a_k) / (a_l a_l) a_i)(b_j - (b_m a_m) / (a_n a_n) a_j)] = \delta_{ij} - a_i a_j / (a_k a_k)

so \mathbb{E}[c] = 0 and \mathbb{V}[c] = \mathbf{I}_N - (a a^\mathsf{T}) / (a^\mathsf{T} a).

For the case c = W b, as W is only a function of a we immediately have \mathbb{E}[c] = 0 and \mathbb{V}[c] = \mathbb{E}[W b b^\mathsf{T} W^\mathsf{T}] = W W^\mathsf{T}.

We know that V = [\hat{a} ~W] where \hat{a} = a / \sqrt(a^\mathsf{T} a) and V V^\mathsf{T} = \mathbf{I}_N. We also have that V V^\mathsf{T} = \hat{a}\hat{a}^\mathsf{T} + W W^\mathsf{T} and so that W W^\mathsf{T} = \mathbf{I}_N - \hat{a}\hat{a}^\mathsf{T} = \mathbf{I}_N - aa^\mathsf{T} / (a^\mathsf{T} a).

Therefore in both cases c is Gaussian distribution with zero-mean and covariance matrix \mathbf{I}_N - (a a^\mathsf{T}) / (a^\mathsf{T} a).

2 Likes

Good question, I believe so, but I could be wrong! As explained above I think both constructions should give samples from the same Gaussian distribution on the linear subspace. From the second construction c = W b with the columns of W a set of N - 1 orthogonal unit vectors providing a basis for the linear subspace orthogonal to a and b \sim \mathsf{normal}(0, \mathbf{I}_{N-1}) then we have standard normal distribution along each of the axes of the orthogonal basis of the subspace defined by W, which should correspond to standard normal distribution along any other orthogonal basis W R where R is an (N-1) \times (N-1) orthogonal matrix.

1 Like

Mike, if you are implementing a relatively traditional SEM:

It sounds like you are looking at what latents are downstream from other latents, then conditioning on the relevant upstream latents while trying to ensure that residuals are uncorrelated. If so, I think it could be easier to consider the joint distribution of all latents, as opposed to something like the distribution of A and the distribution of B given A. I could add more detail in a second post if helpful, but this is a different direction from some of the other responses and I might be off base.

Do you have the modified Gram-Schmidt process as part of your “helpful_stan_functions”? Have you made much use of it in practice?

It’s not in that repo but I should put it in. However, I’d update the code so that L and the Gram-Schmidt process are all done together.

I haven’t tried modeling with it though I would be curious if you do.

Thanks.

I haven’t tried this version yet, though I might give it a go. There was another Bayesian PCA that had been posted on this group somewhere that I’ve used. When coding my own versions of these problems, I often have difficulties because you have to make sure you identify it properly and avoid multi-modalities.

I think this is better. It’s basically the QR decomposition. The gram_schmidt function returns the Q matrix and gram_schmidt_R returns the R matrix. Once tuples are in I can combine this into one function.

  // Computes the Gram-Schmidt process to orthogonalize a matrix
matrix gram_schmidt(matrix A) {
int J = rows(A);
int K = cols(A);
if(J < K) reject("number of rows must be >= the number of columns!");
matrix[J, K] Q;
for (k in 1:K) {
vector[J] v = A[:, k];
for (j in 1:k - 1) {
vector[J] u = Q[:, j];
v -= dot_product(v, u) * u;
}
Q[:, k] = v / sqrt(dot_self(v));
}
return Q;
}

matrix gram_schmidt_R(matrix A) {
int J = rows(A);
int K = cols(A);
if(J < K) reject("number of rows must be >= the number of columns!");
matrix[K, K] R;
matrix[J, K] Q;

for (k in 1:K) {
vector[J] v = A[:, k];
for (j in 1:k - 1) {
vector[J] u = Q[:, j];
R[j, k] = dot_product(v, u);
R[k, j] = 0;
v -= R[j, k] * u;
}
R[k, k] = sqrt(dot_self(v));
Q[:, k] = v / R[k, k];
}
return R;
}


Yes, the issue is that there’s no one-to-one mapping between the Euclidean line and the space of orthogonal matrices. Also, the identifiability issue of the the determinant switching between -1 and 1.

@spinkney Tried to use the Gram-Schmidt with a probabilistic PCA approach. Chains did not mix well and the largest Rhat was above 1.55. (some effective samples were really low). However, it did seem to recover the covariance matrix well in my case (though the data I simulated were dominated by a single factor). I think I might stick with other approaches over this.

functions {
// Computes the Gram-Schmidt process to orthogonalize a matrix
matrix gram_schmidt(matrix A) {
int J = rows(A);
int K = cols(A);
matrix[J, K] Q;
if(J < K) reject("number of rows must be >= the number of columns!");
for (k in 1:K) {
vector[J] v = A[:, k];
for (j in 1:k - 1) {
vector[J] u = Q[:, j];
v -= dot_product(v, u) * u;
}
Q[:, k] = v / sqrt(dot_self(v));
}
return Q;
}
}
data {
int<lower=1> T;
int<lower=2> N;
vector[N] X[T];
int<lower=1> N_K;
}
transformed data {
int N_nonzero_loadings = (N_K * (N_K - 1)) / 2 + N_K * (N - N_K);
vector[N] zeros = rep_vector(0.0, N);
}
parameters {
vector<lower=0>[N_K] diag_ele;
real<lower=0> sd_e;
vector<lower=0>[N_K] sd_F;
}
transformed parameters {
matrix[N, N_K] Q; // orthogonal matrix
cov_matrix[N] Sigma;
{
matrix[N, N_K] L;
L[1, 1] = diag_ele[1];
{
int count = 0;
for (j in 2:N_K) {
L[j, j] = diag_ele[j];
for (k in 1:j - 1) {
count += 1;
L[j, k] = off_diag[count];
L[k, j] = 0;
}
}
for (j in (N_K + 1):N) {
for (k in 1:N_K) {
count += 1;
L[j, k] = off_diag[count];
}
}
}
Q = gram_schmidt(L);
}
Sigma = tcrossprod(diag_post_multiply(Q, sd_F));
for (i in 1:N) {
Sigma[i, i] += square(sd_e);
}
}
model {
// prior on L
off_diag ~ std_normal();
diag_ele ~ lognormal(0, 1);
// other priors
sd_e ~ cauchy(0, 1);
sd_F ~ cauchy(0, 1);
// model
X ~ multi_normal(zeros, Sigma);
}

2 Likes