Hi everyone,
Following up on my previous post about compositional biopsy data, I have a more specific question about modeling heteroscedasticity in data with a binary outcome (diseased vs healthy). I’m working with patient biopsy data where I want to predict disease state from the proportions of three cell types present in the tissue sample.
I expect more uncertainty in predictions when the total biopsy area is small.
The goal is to be able to advise clinicians on the minimum biopsy size needed for reliable predictions. (the smaller, the better for the patient)
The data looks like this:
data <- tribble(
~diseased, ~type1_area, ~type2_area, ~type3_area, ~total_area,
1L, 20L, 10L, 5L, 35L,
1L, 30L, 20L, 10L, 60L,
1L, 40L, 30L, 15L, 85L,
0L, 10L, 5L, 2L, 17L,
0L, 20L, 10L, 5L, 35L,
0L, 30L, 15L, 10L, 55L,
0L, 15L, 15L, 4L, 34L
)
I tried two approaches.
First, I tried modeling it as a beta distribution with phi varying by total_area:
fit <- brm(
bf(diseased ~ 0 + type1_ratio + type2_ratio + type3_ratio,
phi ~ total_area),
family = "beta",
data = data
)
This failed because the beta distribution can’t handle 0s and 1s in the outcome.
Then I tried a beta-binomial model, setting trials to 1:
fit2 <- brm(
bf(diseased | trials(1) ~ 0 + type1_ratio + type2_ratio + type3_ratio,
phi ~ total_area),
family = "beta_binomial",
data = data
)
This runs but gives many divergent transitions, so maybe I should put better priors. Also, I’m unsure if using the beta_binomial distribution with trials(1) even makes theoretical sense here?
Is there a better way to model heteroscedasticity in binary outcomes, specifically to capture uncertainty associated to the total_area?
Thanks in advance!