Modeling Heteroscedasticity in Binary Outcome data

Hi everyone,

Following up on my previous post about compositional biopsy data, I have a more specific question about modeling heteroscedasticity in data with a binary outcome (diseased vs healthy). I’m working with patient biopsy data where I want to predict disease state from the proportions of three cell types present in the tissue sample.

I expect more uncertainty in predictions when the total biopsy area is small.
The goal is to be able to advise clinicians on the minimum biopsy size needed for reliable predictions. (the smaller, the better for the patient)

The data looks like this:

data <- tribble(
  ~diseased, ~type1_area, ~type2_area, ~type3_area, ~total_area,
  1L, 20L, 10L, 5L, 35L,
  1L, 30L, 20L, 10L, 60L,
  1L, 40L, 30L, 15L, 85L,
  0L, 10L, 5L, 2L, 17L,
  0L, 20L, 10L, 5L, 35L,
  0L, 30L, 15L, 10L, 55L,
  0L, 15L, 15L, 4L, 34L
)

I tried two approaches.
First, I tried modeling it as a beta distribution with phi varying by total_area:

fit <- brm(
  bf(diseased ~ 0 + type1_ratio + type2_ratio + type3_ratio,
     phi ~ total_area),
  family = "beta",
  data = data
)

This failed because the beta distribution can’t handle 0s and 1s in the outcome.

Then I tried a beta-binomial model, setting trials to 1:

fit2 <- brm(
  bf(diseased | trials(1) ~ 0 + type1_ratio + type2_ratio + type3_ratio,
     phi ~ total_area),
  family = "beta_binomial",
  data = data
)

This runs but gives many divergent transitions, so maybe I should put better priors. Also, I’m unsure if using the beta_binomial distribution with trials(1) even makes theoretical sense here?

Is there a better way to model heteroscedasticity in binary outcomes, specifically to capture uncertainty associated to the total_area?

Thanks in advance!

Assuming that these ratios are the proportions of the biopsy consisting of each type, and noting that based on the example data these are the only possible types (i.e. the ratios should always sum to one), the first thing I’d try is removing type3_ratio as a predictor and re-including the intercept. i.e. bf(diseased | trials(1) ~ type1_ratio + type2_ratio, phi ~ total_area)

Thanks for your reply! You are totally correct with your assumptions. I’ve discussed the issues of different ratios and multicollinearity a bit in the other post I linked above. It seems the opinion of including or excluding the intercept or the third ratio diverge. I guess I can easily test the effect of both by doing model comparison.
Anyways, the most pressing issue for me at this point is how to do model the effect of total area on the uncertainty of the outcome. If your think bf(diseased | trials(1) ~ type1_ratio + type2_ratio, phi ~ total_area) is reasonable, I’ll continue with checking this route.
Thanks again

I don’t promise this will work or solve your problems, but when you have the model
ax + by + c(1-x-y) (no intercept, all three ratios), this is equal to (a-c)x + (b-c)y + c. It seems to me that this formulation has the potential to introduce unnecessarily large correlations between the intercept c and the coefficients a and b. It might be worth at least checking whether switching to the other parameterization makes things nicer.

Thanks @jsocolar for the input. When modelling the ratios, both parameterization seem to work fine. However when doing a CLR transform first and model these transforms. The model only converges with an intercept + 2 CLR component.

With respect to modelling heteroscedastiscity. I am exploring right now the use of obeservation level random effects to model these. I am following allong with The right way to do predictive checks with observation-level random effects – Erik J. Ringen, PhD
It looks like should be able to model the heteroscedasticity in function of biopsy size with the following model

fit <- brm(
  bf(diseased ~ type1_ratio + type2_ratio + (1 + total_area | obs)),
  family = "bernoulli",
  data = data
)

I am new to this, so if someone sees glaring issues with this approach, pleas let me know :)

Thanks

EDIT: fix copy past error in model formula, removed phi parameter

The bernoulli family doesn’t have a distributional parameter phi, so this won’t work as written. Instead, you need to find a way to model the random effect sd as varying by total_area. Hopefully, doing so doesn’t reintroduce any of the problems you were seeing with the beta-binomial model.

You are right. It was a copy past error.
I was actually fitting the model without the phi parameter (see corrected formula in my edit above)