Orthogonal (not orthonormal) matrix parameters

Hello,

I’m working with a ‘Probabilistic Factor Analysis’ model very similar to (and based on) this one: https://www.cs.helsinki.fi/u/sakaya/tutorial. That factor analysis model has no orthogonality constraint or prior, but I think it would be very helpful in my case (without it, I tend to see a couple very important factors repeated over and over on all the dimensions).

There have been a number of conversations on here about sampling from orthogonal matrices. Givens rotations and a couple other options seem to give people troubles (and the reasons seem to be well explained by @betanalpha). I tried a Householder solution a while back and I think I remember it being unworkably slow. This solution, though, of over-parameterizing and finding the polar decomposition seems to have found some success (Efficient orthogonal matrix parameterization - #11 by mjauch, and I was inspired by that to do something just slightly different. I would like to sample orthogonal matrices, not orthonormal matrices (e.g. orthogonal columns of any length). It seems like a polar decomposition followed by a rescaling of the columns by the length of the originals would retain a link between the individual parameters and their counterparts in the transformed matrix, in a way that’s a little more like a ‘centered’ parameterization (compared to having additional scaling parameters for each column). I have some strong data, which seems to like centered parameterizations better.

My question is: would such a transformation require a Jacobian adjustment if I wanted to then place priors directly on the transformed, orthogonal matrix?

    parameters {
          matrix[K,N] Z_raw; // PCA axis scores
          matrix[V,K] W_raw; // PCA variable loadings
    }
    transformed parameters {
          matrix[V,K] W = svd_U(W_raw) * diag_post_multiply(svd_V(W_raw)', sqrt(columns_dot_self(W_raw)));
          matrix[K, N] Z = diag_pre_multiply(sqrt(rows_dot_self(Z_raw)), svd_V(Z_raw')) * svd_U(Z_raw')';
    }
    model {
          to_vector(W) ~  student_t(3,0,1);
          to_vector(Z) ~ std_normal();
          to_vector(W_raw) ~ normal(to_vector(W), 1); // keeps input closer to output, and away from discontinuities?
          to_vector(Z_raw)  ~ normal(to_vector(Z), 1);
    }

P.S. I am aware of issues with multimodality and such. Empirically, this method does actually seem to give me meaningful results with ADVI (a few known clusters of samples always come up accurately in the latent space).

P.P.S. The priors placed directly on the ‘raw’ parameters were inspired by this thread: Divergence /Treedepth issues with unit_vector - #3 by Raoul-Kima. This seems novel for orthogonal matrices and I’m curious if people have thoughts on whether it helps with anything.

P.P.P.S. I am also curious about the generative models implied here: in particular, maybe it would be more realistic to put the student_t() priors on the ‘raw’ matrix, anyway, assuming the ‘raw’ matrix is actually more realistic (no reason to expect different factors to be orthogonal in nature, the orthogonality just helps with interpretation?).

1 Like

Cool! Polar decomposition can be really useful. I believe you will need a jacobian adjustment.

There was a post where the polar decomposition helped with sampling a monotonic spline but this was on data see Identifiability issues with monotone splines?! - #16 by spinkney

1 Like

Thanks!

I’ve seen the tutorials on using qr decomposition for data, too, and see that an adjustment isn’t needed basically for any matrix multiplication because it’s linear? But I wasn’t sure if that was the same for parameters.

I think the other difference with my method compared to, say, the one you linked, is that I rescale the polar decomposition using the original scales. So my intuition says the otherwise necessary jacobian is actually because the standard polar decomposition changes scales, and my transform doesn’t, so…

Intuition is only getting me so far, though!

To be certain you’d have to do the math. It would be neat to report back with a pdf of the math if you do figure it out. I’m sure more people are interested now that svd is in the language.

2 Likes

Unfortunately I’m not sure I can handle the math. Learning as I go here and mostly building intuition, but still not the best math chops… at least not yet. And acceptable answer might just be to give me a nudge in how to do that! Maybe I can set a masters student to work on it…

Maybe this helps, the derivative of the polar decomposition https://www.julianpanetta.com/pdf/notes/PolarDecompositionDerivative.pdf

3 Likes

excellent, yes, that is helpful! thanks!

For now, I think it’s actually a better model anyway if I ignore the jacobian issue and only use the orthogonalized matrix as a prior center to put a soft orthogonal constraint instead of a hard one.

parameters {
      matrix[K,N] Z; // PCA axis scores
      matrix[V,K] W; // PCA variable loadings
}
transformed parameters {
      matrix[V,K] W_ortho = svd_U(W) * diag_post_multiply(svd_V(W)', sqrt(columns_dot_self(W)));
      matrix[K,N] Z_ortho = diag_pre_multiply(sqrt(rows_dot_self(Z)), svd_V(Z')) * svd_U(Z')';
}
model {
      to_vector(W) ~  student_t(3,0,1);
      to_vector(Z) ~ std_normal();
      to_vector(W) ~ normal(to_vector(W_ortho), 0.25); // regularize toward orthogonality
      to_vector(Z)  ~ normal(to_vector(Z_ortho), 0.25);
}

I had actually started with a similar concept a while ago, without the polar decomposition but by putting an extra prior on the correlations formed by the matrix cross products. It actually ‘worked’ fairly well but the choice of prior seemed pretty arbitrary (empirically derived), so if this version works I think it’ll be preferable? Easier to adjust the strength of the regularization, I think.

1 Like

There’s also this way to construct an orthogonal factor loading Bayesian Factor Analysis. There are a few posts on the forum that use this method. I’m curious to see if the method you describe and the latent factor method give similar results and how performance compares. Do you happen to have a reproducible example?

My original intro to Stan years ago was actually via a JAGS model much like the one you just linked. I’ve just been stubborn because I wanted the parameters to be less influenced by the arbitrary choice of order and structure. I’ve been thinking the multimodallity is mostly taken care of by ADVI getting stuck in local modes (and maybe even HMC would get stuck there if we used the ADVI results as inits and the data are strong?). I have also imagined an iterative procedure that uses my model’s ADVI results to help guide the choice of ordering in a model like you linked…

I could potentially get a reproducible model and dataset if others are really interested. Right now, I have a very complicated factor analysis model that mixes microbiome data, transcriptomics, and associated metadata, with 1000s of samples and 1000s of variables (I tend to put the cart in front of the horse). The data I’m experimenting on is not releasable, but I might have a new, smaller, releasable dataset soon (or I can pretty easily find a public one to try this on if we want to do a 1:1 comparison).

Definitely post if you get the data.

There’s also glmpca you may be interested in if you haven’t already seen it. It is not a stan program though. There’s both a python and R package. It’s used for non-normal data and was originally developed for use on single-cell RNA data. I don’t know much more about the method but I’ve filed it away as something to try to implement in Stan if I ever get the time.

1 Like

Haven’t seen that package, thanks! It looks like a couple others that I’ve seen, though, in that it might work essentially for a single dataset, not a multi-omics one like I’m working with? That’s the part that I think is really missing in the world of nice, pre-made packages. Finding a latent space that explains multiple datasets at once, with simultaneous zero- (or low-value-) inflated multinomial, normal, and categorical error models. I guess that’s a tangent from this post’s actual question, but it’s the reason I’ve been trying to do a custom model in Stan for this project.

edit: Since this thread has expanded a bit, I’ll also note that I thought about doing away with the orthogonality issue entirely and allowing all the resulting dims to be correlated. Since the latent variables are then normal, it’s simple enough to do standard PCA on the result in order to see how they differ from one another. This definitely ‘worked’ at showing important clusters in my data, but from a theoretical standpoint it didn’t sit perfectly with me, given non-normal priors on loadings and a little more difficulty understanding the posteriors of individual parameters. But maybe someone could convince me that this simpler solution is actually best, and then I could move on, haha.

With almost certainty of embarrassment, I’ll go ahead and link the model I am currently working with. But no data of course. There are… extra complications in this model because… cart and horse. stan_models/BGPFA.stan at master · rmcminds/stan_models · GitHub

1 Like

I opened an issue in glmpca to see if he’d be interested in adding the heterogeneous case

1 Like

The implementation of a prior model will always depend on the structure of the model. The real question here is whether or not the compatibility of the prior model with one’s domain expertise depends on the structure of the model or not.

If you can evaluate a heuristic prior model and it seems reasonably consistent with your domain expertise (for example generating distributions of orthogonal matrices or induced factor loadings as a prior pushforward check), and the resulting posteriors distributions can be accurately quantified with a given tool like Stan then you can have some mild confidence in your results.
The problem arises when the prior is never checked for that domain expertise consistency, and without that principled regularization the resulting posterior can vary widely depending on arbitrary implementation choices like ordering and scaling.

The posterior distribution expands to all model configurations consistent with the observed data and the state prior model. If the likelihood function is extremely degenerate, as is the case for these kinds of models, then there will be many model configurations consistency with any observed data set. The only way to compensate for that degeneracy (without collecting more data, changing the observation model, etc) is to add more information to the prior model.

But what information can be added to the prior model? Accidental degeneracies (also known as over-parameterizations) like label switching and rotation don’t affect the predictions of the model and hence those should be restricted. But non-accidental degeneracies communicate important information about the model fit and those shouldn’t be ignored. The challenge here is engineering prior models that restrict the accidental degeneracies but not the non-accidental degeneracies.

What problem does this solve?

Restricting the posterior fit to a “local” mode just defines an implicit prior model informed by a particular observation and the details of the algorithm! If that implicit prior just cuts off accidental degeneracies then, sure, no problem, but if it cuts off non-accidental degeneracies then you’re overfitting to that mode and sabotaging your predictive performance.

A useful prior model captures all reasonable model configurations, not just a some. Checking that the configurations in the included mode seem reasonable but not checking to see if the configurations in the excluded modes are not reasonable isn’t a proper prior validation and won’t tell you if you’re solved a problem or not.

Unfortunately in these classes of model the different modes are almost always non-accidental degeneracies. They each define different but equally good stories for how the data could be generated, and importantly they usually inform different decisions, predictions, and the like. Finding a mode might be useful in some cases, but it’s rarely the whole story.

1 Like

I’ll probably have to chew on your response for a while before digesting it all, but for now…

I suppose I realize this, and it’s specifically arbitrary structure that I’m concerned about. I suspect the fully identified model using a lower triangular loading matrix does indeed conflict with my domain expertise. Each column of the matrix has arbitrary variables constrained to be zero, so the latent variables that correspond to those columns are biased arbitrarily. I suspect this is less of a concern when assuming everything is normally distributed, because the columns could potentially be rotated post-hoc and the priors would be the same (I think, since additive combinations of normals are still normal?). But if I want to use sparsity-inducing priors for variable loadings (which I believe are very important for my dataset), then arbitrary rotation of the columns produces weird implicit priors on the latent variables, consisting of additive combinations of student-t or other distributions that depend on the specific rotation.

Of course, for practical purposes, those weird implicit priors might work. Like you said, I should be explicitly checking whether such a model makes sense given my expectations. But I was exploring ways to be more theoretically consistent in at least that regard.

Agree that this is a major challenge.

Well… the feel-good problem of getting a point estimate with at least some local confidence intervals. Incrementally improving our understanding of the system while we struggle to figure the purely theoretically consistent model. In purely practical terms, I’ve tried NUTS on some of these models and the geometry seems to make it take such small step sizes to be completely impractical. Plus, if NUTS did succeed in jumping between modes, even a modal point estimate would be more difficult to retrieve from the samples. The current standard in the field is to use a pipeline of reductionist tools that don’t attempt to address any of the weird error models and have extremely suspect methods of quantifying uncertainty. shudder , I think my ADVI-based solution is better than that!

I see your point here, and I guess ultimately I have to accept that these models simply don’t solve problems. They pull out some reasonable-looking hypotheses from a giant mess of data, which better experimental techniques will have to verify afterwards. Trust me, I prefer simpler, more controlled studies. But I think there’s a place for this.

I agree with @spinkney that I need to do a 1:1 comparison with other options like the fully identified one, though. If each model samples a different space of reasonable-looking hypotheses, then they can both be useful.

So I did spend more time thinking on this. I suppose I understand the problems you’ve pointed out, but I don’t really understand what the recommended solution is for my case. If label switching and reflection are the only degeneracies that should be restricted, is one of the methods discussed particularly good at doing that without unduly restricting the non-accidental degeneracies? The solutions I’ve seen, such as the ‘identified’ model, seem to wind up with implicit priors that exclude reasonable hypotheses, right? I thought what I was attempting, though clearly imperfect, might at least have a slightly larger set of reasonable hypotheses than what currently exists.

The one thing I’m feeling more confident about after this discussion is that, at the very least, the restriction to orthogonal matrices is itself pretty arbitrary. So I could see entirely abandoning that and focusing on a better interpretation of the highly correlated latent variables that result from the original model. I can imagine using Procrustes rotations to ‘fix’ label switching and reflections in a post-hoc summary, and maybe the implicit priors there are more reasonable?

I very much appreciate the input I’ve gotten from both of you. Thanks!

Just a word of warning – sparsity of factor loadings almost always doesn’t really make sense because of these degeneracy issues. In order to even define sparsity in a well-defined way one has to first find an appropriate, interpretable basis where some collection of latent variables can be shrunk towards zero independently of each other. No surprise this is intimately related to being able to build an interpretable prior model even before considering sparsity.

In my opinion, no. I know a colleague working on a really exciting exchangeable prior model for latent factor models that I think will be really, really important but unfortunately there is nothing yet that I can share.

I think so.

It might very well capture a larger set of reasonable hypotheses, but so long as you’re not capturing all of the hypotheses supported by your model you are never going to be able to adequately critique that model. In particular you’ll never be able to discriminate between problems with the model, problems with the assumed experimental design, and problems with just happening to have found a poor mode. Sure you can still throw the fit at heuristic predictive metrics, and maybe you’ll even do better on those metrics, but are you actually understanding your model and how it generalizes?

In order to build a more principled analysis it’s almost always (as in 1 - \epsilon) better to start with a simpler, less degenerate model and build from there. You might not get to include all of the parameters – or even all of the data – you want to, but at least you’ll be starting at a place where you can properly critique the limitations of that model.

This includes, for example, implicit prior models. If you can accurate fit an implicit prior model and any resulting posterior distributions then you can accurate examine the consequences of that model and how consistent they might be with your domain expertise. There may very well be some set of rotations and positivity constraints that restrict the latent factor model to some well-defined subspace that can be fit and critiqued, in which case you can actually determine if it’s good enough for your analysis or not. But if you can’t fit the full model then you’ll never be able to fully characterize the consequences of that implicit prior model.

1 Like

Thanks for all of this. I suppose I will stop chasing this rabbit. There are many individual questions that I was trying to solve in a single PCA-type model, which can surely be asked from a more targeted framework.