Modeling Compositional Biopsy Data

Hi everyone,

I’m new to compositional data analysis and would appreciate guidance on modeling biopsy data to predict disease state. Each biopsy has 3 cell types (type1, type2, type3), and their absolute areas sum to a total_area (which varies between biopsies, as clinicians who take the biopsy decide the size). The goal is to:

  1. Determine how cell type composition relates to disease (diseased vs. healthy).
  2. Advise clinicians on the minimum biopsy size needed for reliable predictions (i.e., quantify uncertainty from biopsy size on prediction).

Data Example:

library(tibble)  
data <- tribble(  
  ~diseased, ~type1_area, ~type2_area, ~type3_area, ~total_area,  
         1L,         20L,         10L,          5L,         35L,  
         1L,         30L,         20L,         10L,         60L,  
         1L,         40L,         30L,         15L,         85L,  
         0L,         10L,          5L,          2L,         17L,  
         0L,         20L,         10L,          5L,         35L,  
         0L,         30L,         15L,         10L,         55L,  
         0L,         15L,         15L,          4L,         34L    
)  

Some notest on the data:

  • Outcome: Binary (diseased = 1/0).
  • Predictors: Cell type areas or better proportions of cell types (e.g., prop1 = type1_area / total_area). In the full dataset, I also have information on other confounders like age, gender and biopsy location, but I’m focusing on cell types for now.
  • Imbalance: In my full dataset, there are 3 X more diseased samples than healthy.

Questions:

  1. Modeling Compositional Predictors:

    • Is diseased ~ prop1 + prop2 (omitting prop3) sufficient, or should I use something like a CLR transformation to avoid multi-collinearity? Does CLR add value here?
    • I’ve seen warnings about including all proportions (e.g., diseased ~ 0 + prop1 + prop2 + prop3). Is this ever valid?
    • I assume that working with proportions is better than absolute areas. Or is there a better way to model this data directly from the areas?
  2. Unbalanced Data:

    • Could the class imbalance (more diseased samples) bias results? How to handle this in brms?
  3. Uncertainty vs. Biopsy Size:

    • How to model prediction uncertainty as a function of total_area?

Any insights on model structure, transformations, or addressing imbalance would be incredibly helpful!
Thanks in advance!

Your guidance here should (among other things) be driven on how you’d like to interpret your results.

Using a log-ratio transformation is a standard way of analyzing such data. But your predicters are now logratios, in other words, only relative information about the types. However, this may make sense for your case.

I would not fit a straight linear model while dropping one of the types. Instead drop the intercept and use all types. The issue here is that the interpretation of your regression parameters cannot be the standard “what is the effect of increasing by one unit, keeping everything else same?” because of course you cannot do that! They can be interpreted as contributions to the response though. I am not sure what warnings you have seen about this method, but know that it is a standard technique when the proportions are part of an experimental design.

I would stick to your idea of analyzing proportions – doing areas is adding variability to the predictors do solely[?] to the sample collection. If you wished to use total area to model uncertainty, you could use the areas as a weighting. I have never implemented this in a Bayesian context however, so maybe others can help.

Class imbalance more affects the variability of your estimating. The bias I can see here would be selection – your results are not likely to be that applicable to other biopsy studies to a different disease with different incidence. If this happens with the same disease, the concern is you may be missing another factor which contributes to the incidence.

Good luck.

Hey,
Thanks for your reply!
The warning I saw about adding all proportions is that all proportions sum to 1, making the model unidentifiable. I guess removing the intercept would solve that.
With regard to the uncertainty. I was now thinking of trying to use a beta distribution to model the disease state. In BRMS, I think I can then just model the Sigma component in function of total_area. However not that sure if this a valid approach. Especially since the disease state is binary (and not a proportion).

Best regards