Hello,
I’m working with a ‘Probabilistic Factor Analysis’ model very similar to (and based on) this one: https://www.cs.helsinki.fi/u/sakaya/tutorial. That factor analysis model has no orthogonality constraint or prior, but I think it would be very helpful in my case (without it, I tend to see a couple very important factors repeated over and over on all the dimensions).
There have been a number of conversations on here about sampling from orthogonal matrices. Givens rotations and a couple other options seem to give people troubles (and the reasons seem to be well explained by @betanalpha). I tried a Householder solution a while back and I think I remember it being unworkably slow. This solution, though, of over-parameterizing and finding the polar decomposition seems to have found some success (Efficient orthogonal matrix parameterization - #11 by mjauch, and I was inspired by that to do something just slightly different. I would like to sample orthogonal matrices, not orthonormal matrices (e.g. orthogonal columns of any length). It seems like a polar decomposition followed by a rescaling of the columns by the length of the originals would retain a link between the individual parameters and their counterparts in the transformed matrix, in a way that’s a little more like a ‘centered’ parameterization (compared to having additional scaling parameters for each column). I have some strong data, which seems to like centered parameterizations better.
My question is: would such a transformation require a Jacobian adjustment if I wanted to then place priors directly on the transformed, orthogonal matrix?
parameters {
matrix[K,N] Z_raw; // PCA axis scores
matrix[V,K] W_raw; // PCA variable loadings
}
transformed parameters {
matrix[V,K] W = svd_U(W_raw) * diag_post_multiply(svd_V(W_raw)', sqrt(columns_dot_self(W_raw)));
matrix[K, N] Z = diag_pre_multiply(sqrt(rows_dot_self(Z_raw)), svd_V(Z_raw')) * svd_U(Z_raw')';
}
model {
to_vector(W) ~ student_t(3,0,1);
to_vector(Z) ~ std_normal();
to_vector(W_raw) ~ normal(to_vector(W), 1); // keeps input closer to output, and away from discontinuities?
to_vector(Z_raw) ~ normal(to_vector(Z), 1);
}
P.S. I am aware of issues with multimodality and such. Empirically, this method does actually seem to give me meaningful results with ADVI (a few known clusters of samples always come up accurately in the latent space).
P.P.S. The priors placed directly on the ‘raw’ parameters were inspired by this thread: Divergence /Treedepth issues with unit_vector - #3 by Raoul-Kima. This seems novel for orthogonal matrices and I’m curious if people have thoughts on whether it helps with anything.
P.P.P.S. I am also curious about the generative models implied here: in particular, maybe it would be more realistic to put the student_t() priors on the ‘raw’ matrix, anyway, assuming the ‘raw’ matrix is actually more realistic (no reason to expect different factors to be orthogonal in nature, the orthogonality just helps with interpretation?).