Hi everyone,
I’m new to compositional data analysis and would appreciate guidance on modeling biopsy data to predict disease state. Each biopsy has 3 cell types (type1, type2, type3), and their absolute areas sum to a total_area (which varies between biopsies, as clinicians who take the biopsy decide the size). The goal is to:
- Determine how cell type composition relates to disease (diseased vs. healthy).
- Advise clinicians on the minimum biopsy size needed for reliable predictions (i.e., quantify uncertainty from biopsy size on prediction).
Data Example:
library(tibble)
data <- tribble(
~diseased, ~type1_area, ~type2_area, ~type3_area, ~total_area,
1L, 20L, 10L, 5L, 35L,
1L, 30L, 20L, 10L, 60L,
1L, 40L, 30L, 15L, 85L,
0L, 10L, 5L, 2L, 17L,
0L, 20L, 10L, 5L, 35L,
0L, 30L, 15L, 10L, 55L,
0L, 15L, 15L, 4L, 34L
)
Some notest on the data:
- Outcome: Binary (
diseased
= 1/0). - Predictors: Cell type areas or better proportions of cell types (e.g.,
prop1 = type1_area / total_area
). In the full dataset, I also have information on other confounders like age, gender and biopsy location, but I’m focusing on cell types for now. - Imbalance: In my full dataset, there are 3 X more diseased samples than healthy.
Questions:
-
Modeling Compositional Predictors:
- Is
diseased ~ prop1 + prop2
(omittingprop3
) sufficient, or should I use something like a CLR transformation to avoid multi-collinearity? Does CLR add value here? - I’ve seen warnings about including all proportions (e.g.,
diseased ~ 0 + prop1 + prop2 + prop3
). Is this ever valid? - I assume that working with proportions is better than absolute areas. Or is there a better way to model this data directly from the areas?
- Is
-
Unbalanced Data:
- Could the class imbalance (more diseased samples) bias results? How to handle this in
brms
?
- Could the class imbalance (more diseased samples) bias results? How to handle this in
-
Uncertainty vs. Biopsy Size:
- How to model prediction uncertainty as a function of
total_area
?
- How to model prediction uncertainty as a function of
Any insights on model structure, transformations, or addressing imbalance would be incredibly helpful!
Thanks in advance!