Modelling linear looking mixtures in data

emiruz · January 21, 2020, 11:03am

I have some data that looks like this:

PC1,PC2 are principal components (they account for >.95 of the variance). I’d like to try to model a potential data generating process for it.

I’m presently looking at modelling it as a mixture of heteroscedastic linear regressions because of all sharp lines and gradients with gamma marginals for PC1. I ran some pilots and I can fit a K=5 mixture to it with Stan although I can’t make it identifiable yet.

My approach feels a bit arbitrary so I wondered what your first thoughts would be regarding how to approach modelling data that looks a bit like this?

Max_Mantei · January 21, 2020, 12:16pm

Hey! :)

Just some quick thoughts…

Did you center and scale (maybe also log strictly positive data etc.) the data that went into PCA?
How does the plot look like when you transform PC1 and PC2 with something like asinh()?
It’s a bit hard for me to think about the (or a) data generating process when there is PCA involved… Or maybe I don’t full understand what your goal is… Are the principal components inputs for another model?
Maybe you can identify the mixture model by having the slopes as ordered vector? You’d have to think of what happens to the intercept then, though – maybe reverse order it… idk

Sorry if that’s a bit confusing… Hope you’ll find a good approach!

emiruz · January 21, 2020, 12:39pm

In essence yes :) The two PCA components account for so much variance and reduce a 5D dataset to 2D. I think that makes it more suitable for mixtures. I’ve never been able to fit mixtures where the components are high dimensional. So I’m basically considering PC1 and PC2 as the dataset to model.

I’ll give it a go; I haven’t used this transform before, would you mind saying a bit regarding when it’s useful?

I’ve got all of Betancourt’s tricks and a few more already baked in :-)

Centering the data means I lose relative magnitudes which is problematic for me in this case. Log scaling produces a similar picture but the “explosion” just looks more curved.

Max_Mantei · January 21, 2020, 12:53pm

I think of it as log transform with a bit more flexibility (allowing <= 0 data). Some use it to model data (mainly to to avoid log(y + 1) situations, but these should be modeled differently anyways imo), but I mainly use it to plot skewed data that is positive and negative (like your PC2; PC1 looks strictly positive from glancing at the plot, right?).

emiruz · January 21, 2020, 1:05pm

I see, I guess the thinking is, the more symmetrical the data the easier to model, so try some transforms to achieve symmetry if possible. Is that accurate?

That’s right. Interestingly, if I take every 5D point and do a linear regression on it, and then take the intercept and slope as it’s 2D counterpart; I get a very similar plot. The slope is always positive, but the intercept varies.

Max_Mantei · January 21, 2020, 2:19pm

I mean, if you do a regression such as y ~ b0 + b1 * PC1 + b2 * PC2 + e, what matters is the functional form with respect to y. Strictly positive covariates usually point at a multiplicative effect of that covariate on the outcome (not necessarily though) - thus the log-transform you see usually, but if the covariate has 0’s or some negative values but is still skewed you can try other transformations… Heteroskedasticity is often just a misspecification problem (wrong functional form). You could probably also just model it as a GAM with splines or GPs if your really not sure about the functional form.*

Also, sometimes the transformation just makes plotting nicer, so you can maybe see a bit more structure in the data.

*) This is probably not really helpful, if you want to estimate a mixture of simpler models eventually? I think I would almost alwasy prefer to put all covariates in (maybe with QR decomp, if they’re highly correlated) and work from there… idk

emiruz · January 21, 2020, 2:36pm

This is useful food for thought. Thank you!

Topic		Replies	Views
Mixture Models Modeling fitting-issues	9	2172	September 16, 2020
Mixture of incrementally nested factor models Modeling specification	6	624	April 20, 2018
Mixture Model Fitting Modeling mixture	6	785	May 31, 2020
Working out an SEM-style alternative to the multivariate normal Modeling techniques , specification	3	763	May 28, 2021
Trouble with Gaussian Mixture Model Modeling	10	3958	December 7, 2018

Modelling linear looking mixtures in data

Related Topics