PC1,PC2 are principal components (they account for >.95 of the variance). I’d like to try to model a potential data generating process for it.

I’m presently looking at modelling it as a mixture of heteroscedastic linear regressions because of all sharp lines and gradients with gamma marginals for PC1. I ran some pilots and I can fit a K=5 mixture to it with Stan although I can’t make it identifiable yet.

My approach feels a bit arbitrary so I wondered what your first thoughts would be regarding how to approach modelling data that looks a bit like this?

Did you center and scale (maybe also log strictly positive data etc.) the data that went into PCA?

How does the plot look like when you transform PC1 and PC2 with something like asinh()?

It’s a bit hard for me to think about the (or a) data generating process when there is PCA involved… Or maybe I don’t full understand what your goal is… Are the principal components inputs for another model?

Maybe you can identify the mixture model by having the slopes as ordered vector? You’d have to think of what happens to the intercept then, though – maybe reverse order it… idk

Sorry if that’s a bit confusing… Hope you’ll find a good approach!

In essence yes :) The two PCA components account for so much variance and reduce a 5D dataset to 2D. I think that makes it more suitable for mixtures. I’ve never been able to fit mixtures where the components are high dimensional. So I’m basically considering PC1 and PC2 as the dataset to model.

I’ll give it a go; I haven’t used this transform before, would you mind saying a bit regarding when it’s useful?

I’ve got all of Betancourt’s tricks and a few more already baked in :-)

Centering the data means I lose relative magnitudes which is problematic for me in this case. Log scaling produces a similar picture but the “explosion” just looks more curved.

I think of it as log transform with a bit more flexibility (allowing <= 0 data). Some use it to model data (mainly to to avoid log(y + 1) situations, but these should be modeled differently anyways imo), but I mainly use it to plot skewed data that is positive and negative (like your PC2; PC1 looks strictly positive from glancing at the plot, right?).

I see, I guess the thinking is, the more symmetrical the data the easier to model, so try some transforms to achieve symmetry if possible. Is that accurate?

That’s right. Interestingly, if I take every 5D point and do a linear regression on it, and then take the intercept and slope as it’s 2D counterpart; I get a very similar plot. The slope is always positive, but the intercept varies.

I mean, if you do a regression such as y ~ b0 + b1 * PC1 + b2 * PC2 + e, what matters is the functional form with respect to y. Strictly positive covariates usually point at a multiplicative effect of that covariate on the outcome (not necessarily though) - thus the log-transform you see usually, but if the covariate has 0’s or some negative values but is still skewed you can try other transformations… Heteroskedasticity is often just a misspecification problem (wrong functional form). You could probably also just model it as a GAM with splines or GPs if your really not sure about the functional form.*

Also, sometimes the transformation just makes plotting nicer, so you can maybe see a bit more structure in the data.

*) This is probably not really helpful, if you want to estimate a mixture of simpler models eventually? I think I would almost alwasy prefer to put all covariates in (maybe with QR decomp, if they’re highly correlated) and work from there… idk