Context:
I have 7 time-series variables that represent compositional data - they are proportions that sum to 1 at each point in time. I want to use these as predictors in a regression model to explain a target variable.
The challenge is that these predictors exhibit perfect multicollinearity due to their compositional nature. At any time t, if we know the values of 6 of the predictors, the 7th is deterministic (it must be 1 minus the sum of the other 6).
# Example model formula
target ~ regressor1 + regressor2 + regressor3 + regressor4 + regressor5 + regressor6 + regressor7
My question:
What approaches would you recommend for handling compositional predictors in a regression context using brms?
For now, I’ve been dropping one component in my regressions (but am concerned about interpretation).
My understanding is there might be some compositional regression methods out there, but I’m not sure where to start.
Has anyone implemented some solutions successfully with brms in a similar context? Any guidance on the pros/cons of different approaches would be greatly appreciated.
Thanks for posting, @amynang—that’s a really cool approach. @spinkney just applied the isometric log ratio transform to our sum-to-zero vectors, which provide a similar issue. And we’ll be using it to parameterize simplexes. The simple idea is that it’s a more clever way of turning a simplex with N entries that sum to zero into N - 1 values that are independent on the unconstrained scale.
The isometric log ratio method, what I like to call the Aitchison method (after John Aitchison, father of compositional data analysis) is a valid way to approach a problem like this.
Aitchison developed his approach where it was the dependent variable that was compositional. Having the independent variable be compositional does not invalidate this approach in the least, but suggests approaching your problem as a mixture model, which is exactly this situation.
The standard in this area is J. Cornell’s “Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data”. If I recall, these methods were developed or cases where the mixture (here, your 7 time series components) was set as part of the experimental design. Think food science recipes or pharmaceutical formulations. Again, if your components are observational, it still shouldn’t make the approach unworkable.
The key idea in modeling is to avoid using an intercept term, as it would be collinear with the sum of your main effects. Care must be taken when extending your model to e.g. interactions, because the sum constraint sets limits on how this can be done (but it still can).
While the interpretation of main effects is not as simple as “how does response change if I change one factor one unit, keeping all others the same”, because of course you can’t do that, they can be interpreted as contributions to the response, and they can be compared to each other. Note when using the Aitchison approach, there is no easy way to interpret the main effects, only ratios of them.